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Abstract 

We propose an efficient method for approximating natural gradient descent in neural net¬ 
works which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on 
an efficiently invertible approximation of a neural network’s Fisher information matrix which 
is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by 
approximating various large blocks of the Fisher (corresponding to entire layers) as being the 
Kronecker product of two much smaller matrices. While only several times more expensive to 
compute than the plain stochastic gradient, the updates produced by K-FAC make much more 
progress optimizing the objective, which results in an algorithm that can be much faster than 
stochastic gradient descent with momentum in practice. And unlike some previously proposed 
approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature 
matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic op¬ 
timization regimes. This is because the cost of storing and inverting K-FAC’s approximation 
to the curvature matrix does not depend on the amount of data used to estimate it, which is 
a feature typically associated only with diagonal or low-rank approximations to the curvature 
matrix. 


1 Introduction 


The problem of training neural networks is one of the most important and highly investigated 
ones in maehine learning. Despite work on layer-wise pretraining sehemes, and various sophisti- 
eated optimization methods whieh try to approximate Newton-Raphson updates or natural gradi¬ 
ent updates, stoehastie gradient deseent (SGD), possibly augmented with momentum, remains the 
method of ehoiee for large-seale neural network training (|Sutskever et al.[[2013[). 


From the work on Hessian-free optimization (HF) (Martens 20101 and related methods (e.g. 
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Vinyals and Povey} 2012) we know that updates eomputed using loeal eurvature information ean 
make mueh more progress per iteration than the sealed gradient. The reason that HF sees fewer 
praetieal applieations than SGD are twofold. Firstly, its updates are mueh more expensive to eom- 
pute, as they involve running linear eonjugate gradient (CG) for potentially hundreds of iterations, 
eaeh of whieh requires a matrix-veetor produet with the eurvature matrix (whieh are as expen¬ 
sive to eompute as the stoehastie gradient on the eurrent mini-bateh). Seeondly, HF’s estimate 
of the eurvature matrix must remain fixed while CG iterates, and thus the method is able to go 
through mueh less data than SGD ean in a eomparable amount of time, making it less well suited 
to stoehastie optimizations. 


As diseussed in [Martens and Sutskever] ( |2012| ) and ||Sutskever et al^ ( |2013| ), CG has the po¬ 
tential to be mueh faster at loeal optimization than gradient deseent, when applied to quadratie 
objeetive funetions. Thus, insofar as the objeetive ean be loeally approximated by a quadratie, 
eaeh step of CG eould potentially be doing a lot more work than eaeh iteration of SGD, whieh 
would result in HF being mueh faster overall than SGD. However, there are examples of quadratie 
funetions (e.g. ^ 2005| ), eharaeterized by eurvature matriees with highly spread-out eigenvalue 
distributions, where CG will have no distinet advantage over well-tuned gradient deseent with 
momentum. Thus, insofar as the quadratie funetions being optimized by CG within HF are of 
this eharaeter, HF shouldn’t in prineiple be faster than well-tuned SGD with momentum. The ex¬ 
tent to whieh neural network objeetive funetions give rise to sueh quadraties is unelear, although 


Sutskever et al. (2013) provides some preliminary evidenee that they do. 


CG falls vietim to this worst-ease analysis beeause it is a first-order method. This motivates 
us to eonsider methods whieh don’t rely on first-order methods like CG as their primary engines of 
optimization. One sueh elass of methods whieh have been widely studied are those whieh work by 
direetly inverting a diagonal, bloek-diagonal, or low-rank approximation to the eurvature matrix 
(e.g. [Beeker and LeCunj |1989[ jSehaul et al.j , |2013[ |Zeiler[ |2()13[ |Le Roux et al. 


2008 Ollivier 


20131. In faet, a diagonal approximation of the Fisher information matrix is used within HF as a 
preeonditioner for CG. However, these methods provide only a limited performanee improvement 


in praetiee, espeeially eompared to SGD with momentum (see for example Sehraudolph et al. 


2007 Zeiler[ 2013), and many praetitioners tend to forgo them in favor of SGD or SGD with 
momentum. 


We know that the eurvature assoeiated with neural network objeetive funetions is highly non¬ 
diagonal, and that updates whieh properly respeet and aeeount for this non-diagonal eurvature, 
sueh as those generated by HF, ean make mueh more progress minimizing the objeetive than the 
plain gradient or updates eomputed from diagonal approximations of the eurvature (usually ~10^ 
HF updates are required to adequately minimize most objeetives, eompared to the - 10® 
required by methods that use diagonal approximations). Thus, if we had an effieient and direet 
way to eompute the inverse of a high-quality non-diagonal approximation to the eurvature matrix 
(i.e. without relying on first-order methods like CG) this eould potentially yield an optimization 
method whose updates would be large and powerful like HF’s, while being (almost) as eheap to 
eompute as the stoehastie gradient. 
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In this work we develop sueh a method, whieh we eall Kroneeker-faetored Approximate Cur¬ 
vature (K-FAC). We show that our method ean be mueh faster in praetiee than even highly tuned 
implementations of SGD with momentum on eertain standard neural network optimization beneh- 
marks. 


The main ingredient in K-FAC is a sophistieated approximation to the Fisher information 
matrix, whieh despite being neither diagonal nor low-rank, nor even bloek-diagonal with small 
bloeks, ean be inverted very effieiently, and ean be estimated in an online fashion using arbitrarily 
large subsets of the training data (without inereasing the eost of inversion). 


This approximation is built in two stages. In the first, the rows and eolumns of the Fisher are 
divided into groups, eaeh of whieh eorresponds to all the weights in a given layer, and this gives 
rise to a bloek-partitioning of the matrix (where the bloeks are much larger than those used by 


Roux et al. (20081 or Ollivier (20131). These bloeks are then approximated as Kroneeker produets 


between mueh smaller matriees, whieh we show is equivalent to making eertain approximating 
assumptions regarding the statisties of the network’s gradients. 


In the seeond stage, this matrix is further approximated as having an inverse whieh is either 
bloek-diagonal or bloek-tridiagonal. We justify this approximation through a eareful examination 
of the relationships between inverse eovarianees, tree-struetured graphieal models, and linear re¬ 
gression. Notably, this justifieation doesn’t apply to the Fisher itself, and our experiments confirm 
that while the inverse Fisher does indeed possess this structure (approximately), the Fisher itself 
does not. 


The rest of this paper is organized as follows. Section gives basic background and notation 
for neural networks and the natural gradient. Section describes our initial Kroneeker product 
approximation to the Fisher. Section [^describes our further block-diagonal and bloek-tridiagonal 
approximations of the inverse Fisher, and how these can be used to derive an efficient inversion 
algorithm. Section [^describes how we compute online estimates of the quantities required by our 
inverse Fisher approximation over a large “window” of previously processed mini-batches (which 
makes K-FAC very different from methods like HF or KSD, which base their estimates of the 
curvature on a single mini-batch). Section|^describes how we use our approximate Fisher to obtain 
a practical and robust optimization algorithm which requires very little manual tuning, through the 
careful application of various theoretically well-founded “damping” techniques that are standard in 
the optimization literature. Note that damping techniques compensate both for the local quadratic 
approximation being implicitly made to the objective, and for our further approximation of the 
Fisher, and are non-optional for essentially any 2nd-order method like K-FAC to work properly. 


as is well established by both theory and practice within the optimization literature (Nocedal and 


Wright 2006|). Section [^describes a simple and effective way of adding a type of “momentum” to 


K-FAC, which we have found works very well in practice. Section [^describes the computational 
costs associated with K-FAC, and various ways to reduce them to the point where each update 
is at most only several times more expensive to compute than the stochastic gradient. Section 
1^ gives complete high-level pseudocode for K-FAC. Section 10 characterizes a broad class of 
network transformations and reparameterizations to which K-FAC is essentially invariant. Section 
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[^considers some related prior methods for neural network optimization. Proofs of formal results 
are loeated in the appendix. 


2 Background and notation 

2.1 Neural Networks 


In this seetion we will define the basie notation for feed-forward neural networks whieh we will 


use throughout this paper. Note that this presentation elosely follows the one from Martens (2014). 


A neural network transforms its input oq = x to an output f{x,9) = ai through a series 
of i layers, eaeh of whieh eonsists of a bank of units/neurons. The units eaeh receive as input 
a weighted sum of the outputs of units from the previous layer and compute their output via a 
nonlinear “activation” function. We denote by Si the vector of these weighted sums for the Ath 
layer, and by at the vector of unit outputs (aka “activities”). The precise computation performed at 
each layer i G is given as follows: 


Si = Witti^i 


where 0, is an element-wise nonlinear function, Wi is a weight matrix, and ai is defined as the 
vector formed by appending to a* an additional homogeneous coordinate with value 1. Note that 
we do not include explicit bias parameters here as these are captured implicitly through our use of 
homogeneous coordinates. In particular, the last column of each weight matrix Wi corresponds to 
what is usually thought of as the “bias vector”. Figure [^illustrates our definition for i = 2. 

We will define 6 = [vec(IFi)^ vec(IF 2 )^ • • • vec(lF£)^]^, which is the vector consisting of 
all of the network’s parameters concatenated together, where vec is the operator which vectorizes 
matrices by stacking their columns together. 

We let L{y, z) denote the loss function which measures the disagreement between a prediction 
z made by the network, and a target y. The training objective function h{0) is the average (or 
expectation) of losses L{y, /(x, 9)) with respect to a training distribution over input-target 
pairs (x, y). h{9) is a proxy for the objective which we actually care about but don’t have access 
to, which is the expectation of the loss taken with respect to the true data distribution Qx,y 

We will assume that the loss is given by the negative log probability associated with a simple 
predictive distribution Ry\z for U parameterized by i.e. that we have 

L{,y,z) = - \ogr{y\z) 


where r is Ry\zS density function. This is the case for both the standard least-squares and cross¬ 
entropy objective functions, where the predictive distributions are multivariate normal and multi¬ 
nomial, respectively. 
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We will let Py\x{d) = Ry\f{x,e) denote the eonditional distribution defined by the neural net¬ 
work, as parameterized by 9, and p{y\x, 9) = r{y\f{x, 9)) its density funetion. Note that minimiz¬ 
ing the objeetive funetion h(9) ean be seen as maximum likelihood learning of the model Pyix(9). 

For eonvenienee we will define the following additional notation: 


Vv 


dL(y,f(x,9)) 

dv 


dlogp(yjx,9) 

dv 


and Pi = Vsi 


Algorithmic shows how to eompute the gradient V9 of the loss funetion of a neural network 
using standard baekpropagation. 


Algorithm 1 An algorithm for eomputing the gradient of the loss L{y, f{x, 9)) for a given {x, y). 
Note that we are assuming here for simplieity that the (pi are defined as eoordinate-wise funetions. 

input: ao = x;9 mapped to {Wi, W 2 , ■ ■ ■, We). 

/* Forward pass */ 

for all i from 1 to £ do 

Si <— Widi-i 

o-i ^ <Pi{Si) 

end for 


/* Loss derivative eomputation */ 

dL{y,z) 


PcLf i — 


dz 


z=ai 


/* Baekwards pass */ 

for all i from i downto 1 do 

Pi ^ VaiQcp'iisi) 

PWi ■(— gidj_i 
Ptti-i ■<— Qi 

end for 

output: V9 = [weciVW^y vec(r>lL 2 )^ ... ^eciVWiYY 
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Figure 1: A depiction of a standard feed-forward neural network for £ = 2. 


2.2 The Natural Gradient 


Because our network defines a conditional model Py\x{9), it has an associated Fisher information 
matrix (which we will simply call “the Fisher”) which is given by 


F = E 


dlogp{y\x, 9) d\ogp{y\x, 9) 


d9 


d9 


= E[V9V9 


Ti 


Here, the expectation is taken with respect to the data distribution over inputs x, and the model’s 
predictive distribution Py\xi9) over y. Since we usually don’t have access to Qx, and the above 
expectation would likely be intractable even if we did, we will instead compute F using the training 
distribution Qx over inputs x. 


The well-known natural gradient (Amari, 19981 is defined as F ^Vh{9). Motivated from the 


perspective of information geometry ( Amari and Nagao^ 20001, the natural gradient defines the 
direction in parameter space which gives the largest change in the objective per unit of change in 
the model, as measured by the KL-divergence. This is to be contrasted with the standard gradient, 
which can be defined as the direction in parameter space which gives the largest change in the 
objective per unit of change in the parameters, as measured by the standard Euclidean metric. 

The natural gradient also has links to several classical ideas from optimization. It can be 
shown ( Martens! 2014[ Pascanu and BengT^|2014 ) that the Fisher is equivalent to the General¬ 
ized Gauss-Newton matrix (GGN) (Schraudolph 2002 Martens and Sutskever[ 20121 in certain 
important cases, which is a well-known positive semi-definite approximation to the Hessian of the 
objective function. In particular, ([Martens 20141 showed that when the GGN is defined so that the 
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network is linearized up to the loss funetion, and the loss funetion eorresponds to the negative log 
probability of observations under an exponential family model Ry\z with 2 ; representing the natural 
parameters, then the Fisher eorresponds exaetly to the GGNj^ 


The GGN has served as the eurvature matrix of ehoiee in HF and related methods, and so in 
light of its equivalenee to the Fisher, these 2nd-order methods ean be seen as approximate natural 
gradient methods. And perhaps more importantly from a praetieal perspeetive, natural gradient- 
based optimization methods ean eonversely be viewed as 2nd-order optimization methods, whieh 


as pointed out by Martens (2014)), brings to bare the vast wisdom that has aeeumulated about how 
to make sueh methods work well in both theory and praetiee (e.g Noeedal and Wrigh^ 2006). In 
Seetion we produetively make use of these eonneetions in order to design a robust and highly 
effeetive optimization method using our approximation to the natural gradient/Fisher (whieh is 
developed in Seetionsj^andQ. 


For some good reeent diseussion and analysis of the natural gradient, see Arnold et al. (20111; 


Martens (2014); Paseanu and Bengio (2014). 


3 A block-wise Kronecker-factored Fisher approximation 


The main eomputational ehallenge assoeiated with using the natural gradient is eomputing F~^ (or 
its produet with V/i). For large networks, with potentially millions of parameters, eomputing this 
inverse naively is eomputationally impraetieal. In this seetion we develop an initial approximation 
of F whieh will be a key ingredient in deriving our effieiently eomputable approximation to F~^ 
and the natural gradient. 

Note that V6 = [vec('DkFi)^ vec{VW 2 y ■ ■ ■ vec{VWey]^ and so F ean be expressed as 

F = E [veve'^] 

= E [[vec(F>iyi)^ vec(F>iy2)^ ••• yec{VWey]^[yec{VWiy yec{VW2y yec{VWey]] 


E 

vec(F>iyi) vec(F>lEi)^' 

E 

yeciVWi) yec(VW 2 y' 

... E 

yec{VWi)yec{VWgy' 

E 

vec(PlE2) vec(F>lEi)^' 

E 

yec{VW2) yeciVW^y 

... E 

yec\vW2) yeciVWgY' 


E [yeciVWe) vec(F>lEi)^] E [vec(F>lE^) vec(F>lE 2 )^] • ■ ■ E [yec(VWt) yecipWeV] 


Thus, we see that F ean be viewed as an 

by Fij = E [yeciVWi) yeciVWjY]. 


' by I bloek matrix, with the (z, j)-th bloek Fij given 


'Note that the condition that z represents the natural parameters might require one to formally include the nonlinear 
transformation usually performed by the hnal nonlinearity (/)£ of the network (such as the logistic-sigmoid transform 
before a cross-entropy error) as part of the loss function L instead. Equivalently, one could linearize the network only 
up to the input sg, to (pg when computing the GGN (see Martens and Sutskever (2012[l). 
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Noting that VWi = gia~l_i and that Yec{uv^) = v ® u we have vec{VWi) = vec{giaj_^) = 
tti-i ® gi, and thus we ean rewrite Fij as 

Fij = E [vec{VWi) veciVWj^] = E [(di_i 0 ® 9jV] = E [(di-i <8 ® dj )] 

= E ® gtgj] 


where A® B denotes the Kroneeker produet between A G and B, and is given by 


A^B 


MuB 




[A]l,nB 

[A]m,nB 


Note that the Kroneeker produet satisfies many eonvenient properties that we will make use of in 
this paper, espeeially the identity {A 0 B)~^ = A~^ (g) B~^. See Van Loan (2000) for a good 
diseussion of the Kroneeker produet. 


Our initial approximation F to F will be defined by the following bloek-wise approximation: 


Fij = E O gigj] E [ai_iaj_^] (g) E [gigj] = A-ij-i ® Gij = Fij (1) 

where Aij = E [djdj] and Gij = E [gigj] ■ 

This gives 



^0,0 ® Gi,i 

^0,1 C) Gi,2 

Ao/-i ® Gi/ 


Aifi ® G2,i 

^1,1 C) (^2,2 


F = 


^£-1,0 ® Gi^i 

^£-1,1 ® Gi^2 ■ ■ 

0 Gi^i 


whieh has the form of what is known as a Khatri-Rao produet in multivariate statisties. 

The expeetation of a Kroneeker produet is, in general, not equal to the Kroneeker produet of 
expeetations, and so this is indeed a major approximation to make, and one whieh likely won’t be- 
eome exaet under any realistie set of assumptions, or as a limiting ease in some kind of asymptotie 
analysis. Nevertheless, it seems to be fairly aeeurate in praetiee, and is able to suceessfully eapture 
the “eoarse strueture” of the Fisher, as demonstrated in Figure for an example network. 

As we will see in later seetions, this approximation leads to signifieant eomputational savings 
in terms of storage and inversion, whieh we will be able to leverage in order to design an effieient 
algorithm for eomputing an approximation to the natural gradient. 


3.1 Interpretations of this approximation 

Consider an arbitrary pair of weights [Wi\k-^^k 2 and [Wj]k 3 ,k 4 , from the network, where [-jij denotes 
the value of the (i, j)-th entry. We have that the eorresponding derivatives of these weights are 
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Figure 2: A comparison of the exact Fisher F and our block-wise Kronecker-factored approximation F, 
for the middle 4 layers of a standard deep neural network partially trained to classify a 16x16 down-scaled 
version of MNIST. The network was trained with 7 iterations of K-FAC in batch mode, achieving 5% 
error (the error reached 0% after 22 iterations) . The network architecture is 256-20-20-20-20-20-10 and 
uses standard tanh units. On the left is the exact Fisher F, in the middle is our approximation F, and on 
the right is the difference of these. The dashed lines delineate the blocks. Note that for the purposes of 
visibility we plot the absolute values of the entries, with the white level corresponding linearly to the size of 
these values (up to some maximum, which is the same in each image). 

given by V[Wi\ki,k 2 = and T^[Wj]k 3 ,k 4 = where we denote for eonvenienee = 

[dj_i]fc,, = [dj_i]fc 3 , = [gi]k 2 , and = [gj\k4- 

The approximation given by eqn. [^is equivalent to making the following approximation for 
each pair of weights: 

E [VIW,],,4MW,Wm] = E [(s(‘)9<‘l)(aP)gP))] = E [al‘>g<^l » E [g'-lgP)] E [gdgP)] 

( 2 ) 

And thus one way to interpret the approximation in eqn.[2is that we are assuming statistical inde¬ 
pendence between products of unit activities and products of unit input derivatives. 

Another more detailed interpretation of the approximation emerges by considering the follow¬ 
ing expression for the approximation error E — E E (which is 

derived in the appendix): 

g^^\ g^^^) + g^^\ g^^\ g^'^'^) (3) 

Here /?(■) denotes the cumulant of its arguments. Cumulants are a natural generalization of 
the concept of mean and variance to higher orders, and indeed Ist-order cumulants are means 
and 2nd-order cumulants are covariances. Intuitively, cumulants of order k measure the degree to 
which the interaction between variables is intrinsically of order k, as opposed to arising from many 
lower-order interactions. 

A basic upper bound for the approximation error is 
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which will be small if all of the higher-order cumulants are small (i.e. those of order 3 or higher). 
Note that in principle this upper bound may be loose due to possible cancellations between the 
terms in eqn. 

Because higher-order cumulants are zero for variables jointly distributed according to a mul¬ 
tivariate Gaussian, it follows that this upper bound on the approximation error will be small insofar 
as the joint distribution over and is well approximated by a multivariate Gaus¬ 

sian. And while we are not aware of an argument for why this should be the case in practice, it 
does seem to be the case that for the example network from Figure the size of the error is well 
predicted by the size of the higher-order cumulants. In particular, the total approximation error, 
summed over all pairs of weights in the middle 4 layers, is 2894.4, and is of roughly the same size 
as the corresponding upper bound (4134.6), whose size is tied to that of the higher order cumulants 
(due to the impossibility of cancellations in eqn.|^. 

4 Additional approximations to F and inverse computations 

To the best of our knowledge there is no efficient general method for inverting a Khatri-Rao product 
like F. Thus, we must make further approximations if we hope to obtain an efficiently computable 
approximation of the inverse Fisher. 

In the following subsections we argue that the inverse of F can be reasonably approximated 
as having one of two special structures, either of which make it efficiently computable. The second 
of these will be slightly less restrictive than the first (and hence a better approximation) at the 
cost of some additional complexity. We will then show how matrix-vector products with these 
approximate inverses can be efficiently computed, which will thus give an efficient algorithm for 
computing an approximation to the natural gradient. 


4.1 Structured inverses and the connection to linear regression 


Suppose we are given a multivariate distribution whose associated covariance matrix is S. 


Define the matrix B so that for i ^ j, [B]ij is the coefficient on the j-th variable in the optimal 
linear predictor of the i-th variable from all the other variables, and for i = j, [B]ij = 0. Then 
define the matrix D to be the diagonal matrix where is the variance of the error associated 
with such a predictor of the i-th variable. 


Pourahmadi (2011) showed that B and D can be obtained from the inverse covariance S 


-1 


by the formulas 


[B] 


ij 




and [D]i^i 


1 
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from which it follows that the inverse eovarianee matrix ean be expressed as 


S-i = D-\I -B) 


Intuitively, this result says that eaeh row of the inverse eovarianee is given by the eoeffi- 
eients of the optimal linear predietor of the i-th variable from the others, up to a sealing faetor. So 
if the j-th variable is mueh less “useful” than the other variables for predieting the i-th variable, 
we ean expeet that the {i, j)-th entry of the inverse eovarianee will be relatively small. 

Note that “usefulness” is a subtle property as we have informally defined it. In partieular, 
it is not equivalent to the degree of eorrelation between the j-th and i-th variables, or any sueh 
simple measure. As a simple example, eonsider the ease where the j-th variable is equal to the 
k-th variable plus independent Gaussian noise. Sinee any linear predietor ean aehieve a lower 
varianee simply by shifting weight from the j-th variable to the k-th variable, we have that the j-th 
variable is not useful (and its eoeffieient will thus be zero) in the task of predieting the i-th variable 
for any setting of i other than i = j or i = k. 

Noting that the Fisher F is a eovarianee matrix over VO w.r.t. the model’s distribution (beeause 
'^[DO] = 0 by Lemma 1^, we ean thus apply the above analysis to the distribution over V6 to gain 
insight into the approximate strueture of and by extension its approximation 

Consider the derivative VWi of the loss with respeet to the weights Wi of layer i. Intuitively, 
if we are trying to prediet one of the entries of VWi from the other entries of V6, those entries 
also in VWi will likely be the most useful in this regard. Thus, it stands to reason that the largest 
entries of F~^ will be those on the diagonal bloeks, so that F~^ will be well approximated as 
bloek-diagonal, with eaeh bloek eorresponding to a different VWi. 


Beyond the other entries of VWi, h is the entries of VWi+i and VWi-i (i.e. those assoeiated 
with adjaeent layers) that will arguably be the most useful in predieting a given entry of VWi. 
This is beeause the true proeess for eomputing the loss gradient only uses information from the 
layer below (during the forward pass) and from the layer above (during the baekwards pass). Thus, 
approximating F~^ as bloek-tridiagonal seems like a reasonable and milder alternative than taking 
it to be bloek-diagonal. Indeed, this approximation would be exaet if the distribution over VO 
were given by a direeted graphieal model whieh generated eaeh of the VWi?,, one layer at a time, 
from either VWi+i or VWi^i. Or equivalently, if VWi were distributed aeeording to an undireeted 
Gaussian graphieal model with binary potentials only between entries in the same or adjaeent 
layers. Both of these models are depleted in Figure]^ 


Now while in reality the VWi? are generated using information from adjaeent layers aeeord¬ 
ing to a proeess that is neither linear nor Gaussian, it nonetheless stands to reason that their joint 
statisties might be reasonably approximated by sueh a model. In faet, the idea of approximating 
the distribution over loss gradients with a direeted graphieal model forms the basis of the reeent 
FANG method oflGrosse and Salakhutdinov|(|2015|). 


Figure examines the extent to whieh the inverse Fisher is well approximated as bloek- 
diagonal or bloek-tridiagonal for an example network. 
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Figure 3: A comparison of our block-wise Kronecker-factored approximation F, and its inverse, using the 
example neural network from Figure]^ On the left is F, in the middle is its exact inverse, and on the right 
is a 4x4 matrix containing the averages of the absolute values of the entries in each block of the inverse. As 
predicted by our theory, the inverse exhibits an approximate block-tridiagonal structure, whereas F itself 
does not. Note that the corresponding plots for the exact F and its inverse look similar. The very small 
blocks visible on the diagonal of the inverse each correspond to the weights on the outgoing connections of 
a particular unit. The inverse was computed subject to the factored Tikhonov damping technique described 
in Sections |6.3| and |6.6[ using the same value of 7 that was used by K-FAC at the iteration from which this 


example was taken (see Figure]^. Note that for the purposes of visibility we plot the absolute values of the 
entries, with the white level corresponding linearly to the size of these values (up to some maximum, which 
is chosen differently for the Fisher approximation and its inverse, due to the highly differing scales of these 
matrices). 
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In the following two subsections we show how both the block-diagonal and block-tridiagonal 
approximations to F~^ give rise to computationally efficient methods for computing matrix-vector 
products with it. And at the end of Section we present two figures (Figures and which 
examine the quality of these approximations for an example network. 


4.2 Approximating F ^ as block-diagonal 

Approximating F~^ as block-diagonal is equivalent to approximating F as block-diagonal. A 
natural choice for such an approximation F of F, is to take the block-diagonal of F to be that of 
F. This gives the matrix 

F = diag F 2 , 2 , ■ ■ ■, , = diag (Ao,o ® ® ^^ 2 , 2 , • • •, ® 


Using the identity {A® B) ^ = A ^ ® B ^we can easily compute the inverse of F as 
F = diag (^ 0,0 ® ^ 1 , 1 ) ^ 1,1 ® ^ 2 , 2 ) • • •) 


Thus, computing F ^ amounts to computing the inverses of 2£ smaller matrices. 

Then to compute u = F~^v, we can make use of the well-known identity [A ® B) vec(X) = 
\ec{BXA'^) to get 


U; = G7 ViA 


r-i 

i—1,2—1 


where v maps to (Ui, 14,..., I 4 ) and u maps to (Ui, U 2 , ■ ■ ■, Ui) in an analogous way to how 9 
maps to (lUi, W 2 ,..., Wi). 


Note that block-diagonal approximations to the Fisher information have been proposed before 
in TONGA ( Le Roux et al.[ 2008), where each block corresponds to the weights associated with a 
particular unit. In our block-diagonal approximation, the blocks correspond to all the parameters 
in a given layer, and are thus much larger. In fact, they are so large that they would be impractical 
to invert as general matrices. 


4.3 Approximating F ^ as block-tridiagonal 

Note that unlike in the above block-diagonal case, approximating F~^ as block-tridiagonal is not 
equivalent to approximating F as block-tridiagonal. Thus we require a more sophisticated ap¬ 
proach to deal with such an approximation. We develop such an approach in this subsection. 

To start, we will define F to be the matrix which agrees with F on the tridiagonal blocks, and 
which satisfies the property that F~^ is block-tridiagonal. Note that this definition implies certain 
values for the off-tridiagonal blocks of F which will differ from those of F insofar as F~^ is not 
actually block-tridiagonal. 
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Figure 4: A diagram depicting the UGGM corresponding to F~^ and its equivalent DGGM. The UGGM’s 
edges are labeled with the corresponding weights of the model (these are distinct from the network’s 
weights). Here, denotes the (i,j)-th block of F~^. The DGGM’s edges are labeled with the 

matrices that specify the linear mapping from the source node to the conditional mean of the destination 
node (whose conditional covariance is given by its label). 


To establish that such a matrix F is well defined and can be inverted efficiently, we first 
observe that assuming that F~^ is block-tridiagonal is equivalent to assuming that it is the preci¬ 
sion matrix of an undirected Gaussian graphical model (UGGM) over VO (as depicted in Figure 
whose density function is proportional to exp{—'D9~^F~^'D9). As this graphical model has a 
tree structure, there is an equivalent directed graphical model with the same distribution and the 
same (undirected) graphical structure (e.g. Bishop[ 2006| ), where the directionality of the edges is 
given by a directed acyclic graph (DAG). Moreover, this equivalent directed model will also be 
linear/Gaussian, and hence a directed Gaussian Graphical model (DGGM). 


Next we will show how the parameters of such a DGGM corresponding to F can be efficiently 
recovered from the tridiagonal blocks of F, so that F is uniquely determined by these blocks (and 
hence well-defined). We will assume here that the direction of the edges is from the higher layers 
to the lower ones. Note that a different choice for these directions would yield a superficially 
different algorithm for computing the inverse of F that would nonetheless yield the same output. 


For each i, we will denote the conditional covariance matrix of vec(DfFj) on vec(DPFj+i) by 
Sj|j+i and the linear coefficients from vec(DkFj+i) to vec(DkFj) by the matrix so that the 

conditional distributions defining the model are 


vec(DfFi) ~ A/'(4'j_i+ivec(DfUi+i), Sqi+i) and \ec{VW^) ~ A/" (^0, 


Since is just the covariance of vec(DfFf), it is given simply by F^^i = 
f < £ — 1, we can see that is given by 


Fi^i. And for 
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where 


^ti,i = and 

The eonditional eovarianee Sj|j+i is thus given by 








,G 


rGT 


Following the work of Grosse and Salakhutdinov (20151, we use the bloek generalization of 
well-known “Cholesky” deeomposition of the preeision matrix of DGGMs ( Pourahmadi[ 19991, 
whieh gives 


P-^ 




AS 


where, 


A diag (S^| 2 , S 213 , 


-1 


and 


I -T^i ,2 

I ~ 11 ' 2,3 


I 


Thus, matrix-veetor multiplieation with P ^ amounts to performing matrix-veetor multipli- 
eation by S, followed by A, and then by S^. 

As in the bloek-diagonal ease eonsidered in the previous subseetion, matrix-veetor produets 
with S (and S^) ean be effieiently eomputed using the well-known identity {A 0 B)~^ = A~^ 0 
In partieular, u = E~^v ean be eomputed as 

f/. = K - 'tpJi.V.-i'I'ij.-i and [/, = V, 
and similarly u = Ev ean be eomputed as 

f/. = K - 'I'S+iK+i'tS., and U, = V, 
where the UiS and 1 ^’s are defined in terms of u and v as in the previous subseetion. 

Multiplying a veetor u by A amounts to multiplying eaeh vec(14) by the eorresponding 
This is slightly trieky beeause Sj|j+i is the differenee of Kroneeker produets, so we eannot use the 
straightforward identity {A (g) B)~^ = A~^ (g) B~^. Fortunately, there are effieient teehniques for 
inverting sueh matriees whieh we diseuss in detail in Appendix]^ 
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4.4 Examining the approximation quality 


Figures and examine the quality of the approximations F and F of F, which are derived by 
approximating F~^ as block-diagonal and block-tridiagonal (resp.), for an example network. 

From Figure [s which compares F and F directly to F, we can see that while F and F 
exactly capture the diagonal and tridiagonal blocks (resp.) of F, as they must by definition, F ends 
up approximating the off-tridiagonal blocks of F very well too. This is likely owed to the fact that 
the approximating assumption used to derive F, that F~^ is block-tridiagonal, is a very reasonable 
one in practice (judging by Figure]^. 

Figure]^ which compares F~^ and F~^ to F~^, paints an arguably more interesting and 
relevant picture, as the quality of the approximation of the natural gradient will be roughly propor- 
tionaj^to the quality of approximation of the inverse Fisher. We can see from this figure that due 
to the approximate block-diagonal structure of F~^, F~^ is actually a reasonably good approxima¬ 
tion of despite F being a rather poor approximation of F (based on Figure [^. Meanwhile, 
we can see that by accounting for the tri-diagonal blocks, F~^ is indeed a significantly better 
approximation of F~^ than F~^ is, even on the diagonal blocks. 

^The error in any approximation of the natural gradient F~^Vh will be roughly proportional to the error 

in the approximation Fq^ of the associated inverse Fisher F~^, since \\F~^\7h — Fg"^V/i|| < ||Vh|| — F'g"^|j. 
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Figure 5: A comparison of our block-wise Kronecker-factored approximation F, and its approximations 
F and F (which are based on approximating the inverse F~^ as either block-diagonal or block-tridiagonal, 
respectively), using the example neural network from Figure On the left is F, in the middle its approx¬ 
imation, and on the right is the absolute difference of these. The top row compares to F and the bottom 
row compares to F. While the diagonal blocks of the top right matrix, and the tridiagonal blocks of the 
bottom right matrix are exactly zero due to how F and F (resp.) are constructed, the off-tridiagonal blocks 
of the bottom right matrix, while being very close to zero, are actually non-zero (which is hard to see from 
the plot). Note that for the purposes of visibility we plot the absolute values of the entries, with the white 
level corresponding linearly to the size of these values (up to some maximum, which is the same in each 
image). 
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Figure 6 : A comparison of the exact inverse F~^ of our block-wise Kronecker-factored approximation F, 
and its block-diagonal and block-tridiagonal approximations F~^ and F~^ (resp.), using the example neural 
network from Figure]^ On the left is F~^, in the middle its approximation, and on the right is the absolute 
difference of these. The top row compares to F~^ and the bottom row compares to F~^. The inverse was 
computed subject to the factored Tikhonov damping technique described in Sections 6.3 and 6.6 using the 
same value of 7 that was used by K-FAC at the iteration from which this example was taken (see Figure 
1^. Note that for the purposes of visibility we plot the absolute values of the entries, with the white level 
corresponding linearly to the size of these values (up to some maximum, which is the same in each image). 


5 Estimating the required statistics 


Recall that A, ^ = E [djdj] and Gij = E [gigj] ■ Both approximate Fisher inverses discussed in 
Section 1^ require some subset of these. In particular, the block-diagonal approximation requires 
them for i = j, while the block-tridiagonal approximation requires them for j G {i, i + 1 } (noting 

that AJj = Aj^i and Gjj = GjA- 


Since the a/s don’t depend on y, we can take the expectation E [a^aj] with respect to just 
the training distribution over the inputs x. On the other hand, the gi's do depend on y, and 
so the expectatior ^ E \gig]\ must be taken with respect to both and the network’s predictive 

is important to note this expectation should not be taken with respect to the training/data distribution over y (i.e. 
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distribution Py^^. 


While computing matrix-vector products with the Gij could be done exactly and efficiently 
for a given input x (or small mini-batch of x’s) by adapting the methods of Schraudolph (2002), 
there doesn’t seem to be a sufficiently efficient method for computing the entire matrix itself. 
Indeed, the hardness results of Martens et al. (2012) suggest that this would require, for each 
example x in the mini-batch, work that is asymptotically equivalent to matrix-matrix multiplication 
involving matrices the same size as Gij. While a small constant number of such multiplications is 
arguably an acceptable cost (see Section [^, a number which grows with the size of the mini-batch 
would not be. 


Instead, we will approximate the expectation over yhy a standard Monte-Carlo estimate ob¬ 
tained by sampling y's from the network’s predictive distribution and then rerunning the backwards 
phase of backpropagation (see Algorithmic as if these were the training targets. 

Note that computing/estimating the required AijIGi/s involves computing averages over 
outer products of various d/s from network’s usual forward pass, and gi"s from the modified back¬ 
wards pass (with targets sampled as above). Thus we can compute/estimate these quantities on the 
same input data used to compute the gradient Vh, at the cost of one or more additional backwards 
passes, and a few additional outer-product averages. Fortunately, this turns out to be quite inexpen¬ 
sive, as we have found that just one modified backwards pass is sufficient to obtain a good quality 
estimate in practice, and the required outer-product averages are similar to those already used to 
compute the gradient in the usual backpropagation algorithm. 


In the case of online/stochastic optimization we have found that the best strategy is to maintain 
running estimates of the required Ai/s and Gi/s using a simple exponentially decaying averaging 
scheme. In particular, we take the new running estimate to be the old one weighted by e, plus the 
estimate on the new mini-batch weighted by 1 — e, for some 0 < e < 1. In our experiments we 
used e = min{l — 1/fc, 0.95}, where k is the iteration number. 

Note that the more naive averaging scheme where the estimates from each iteration are given 
equal weight would be inappropriate here. This is because the Ai/s and Gi/s depend on the 
network’s parameters 6, and these will slowly change over time as optimization proceeds, so that 
estimates computed many iterations ago will become stale. 


This kind of exponentially decaying averaging scheme is commonly used in methods involv¬ 
ing diagonal or block-diagonal approximations (with much smaller blocks than ours) to the curva¬ 


ture matrix (e.g. |LeCun et al.[|1998t|Park et al.[[2()()()j|Schaul et al.[|2or3| ). Such schemes have the 
desirable property that they allow the curvature estimate to depend on much more data than can be 

Qy\x or Qy\x)- Using the training/data distribution for y would perhaps give an approximation to a quantity known as 
the “empirical Fisher information matrix”, which lacks the previously discussed equivalence to the Generalized Gauss- 
Newton matrix, and would not be compatible with the theoretical analysis performed in Section |3.1| (in particular. 
Lemma would break down). Moreover, such a choice would not give rise to what is usually thought of as the 
natural gradient, and based on the findings of Martens (2010 1 , would likely perform worse in practice as part of an 
optimization algorithm. See Martens ( |2014[ ) for a more detailed discussion of the empirical Fisher and reasons why it 
may be a poor choice for a curvature matrix compared to the standard Fisher. 
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reasonably processed in a single mini-batch. 

Notably, for methods like HF which deal with the exact Fisher indirectly via matrix-vector 
products, such a scheme would be impossible to implement efficiently, as the exact Fisher matrix 
(or GGN) seemingly cannot be summarized using a compact data structure whose size is indepen¬ 
dent of the amount of data used to estimate it. Indeed, it seems that the only representation of the 
exact Fisher which would be independent of the amount of data used to estimate it would be an 
explicit nxn matrix (which is far too big to be practical). Because of this, HF and related methods 
must base their curvature estimates only on subsets of data that can be reasonably processed all at 
once, which limits their effectiveness in the stochastic optimization regime. 


6 Update damping 


6.1 Background and motivation 


The idealized natural gradient approach is to follow the smooth patlj^in the Riemannian manifold 
(implied by the Fisher information matrix viewed as a metric tensor) that is generated by taking 
a series of infinitesimally small steps (in the original parameter space) in the direction of the nat¬ 
ural gradient (which gets recomputed at each point). While this is clearly impractical as a real 
optimization method, one can take larger steps and still follow these paths approximately. But in 
our experience, to obtain an update which satisfies the minimal requirement of not worsening the 
objective function value, it is often the case that one must make the step size so small that the 
resulting optimization algorithm performs poorly in practice. 


The reason that the natural gradient can only be reliably followed a short distance is that it 
is defined merely as an optimal direction (which trades off improvement in the objective versus 
change in the predictive distribution), and not a discrete update. 


Fortunately, as observed by [Martens (2014), the natural gradient can be understood using a 


more traditional optimization-theoretic perspective which implies how it can be used to generate 
updates that will be useful over larger distances. In particular, when Ry\;^ is an exponential family 


model with 2 ; as its natural parameters (as it will be in our experiments). Martens (2014) showed 
that the Fisher becomes equivalent to the Generalized Gauss-Newton matrix (GGN), which is a 
positive semi-definite approximation of the Hessian of h. Additionally, there is the well-known 
fact that when L{x, f{x, 9)) is the negative log-likelihood function associated with a given (x, y) 
pair (as we are assuming in this work), the Hessian H of h and the Fisher F are closely related in the 
sense H is the expected Hessian of L under the training distribution Qx,y, while F is the expected 
Hessian of L under the model’s distribution (defined by the density p(x, y) = p{y\x)q{x)). 


Which has the interpretation of being a geodesic in the Riemannian manifold from the current predictive distribu¬ 
tion towards the training distribution when using a likelihood or KL-divergence based objective function (see|Martens| 


(2014). 
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From these observations it follows that 


M{5) =-5' F5 + Vh{e) ' 5 + h{e) 


(5) 


ean be viewed as a eonvex approximation of the 2nd-order Taylor series of expansion of h{5 + 9), 
whose minimizer 5* is the (negative) natural gradient —F~^'Vh{9). Note that if we add an £2 or 

71 

“weight-deeay” regularization term to h of the form - Ill'll 2 , then similarly F + rjI ean be viewed as 

an approximation of the Hessian of h, and replaeing F with F + t]I in M(5) yields an approxima¬ 
tion of the 2nd-order Taylor series, whose minimizer is a kind of “regularized” (negative) natural 
gradient —(F + T]I)~^'Vh{9), whieh is what we end up using in praetiee. 

From the interpretation of the natural gradient as the minimizer of M{6), we ean see that it 
fails to be useful as a loeal update only insofar as M{6) fails to be a good loeal approximation to 


h{6 + 9). And so as argued by Martens (2014), it is natural to make use of the various “damping’ 


teehniques that have been developed in the optimization literature for dealing with the breakdowns 
in loeal quadratie approximations that inevitably oeeur during optimization. Notably, this break¬ 
down usually won’t oeeur in the final “loeal eonvergenee” stage of optimization where the funetion 
beeomes well approximated as a eonvex quadratie within a suffieiently large neighborhood of the 
loeal optimum. This is the phase traditionally analyzed in most theoretieal results, and while it is 
important that an optimizer be able to eonverge well in this final phase, it is arguably mueh more 
important from a praetieal standpoint that it behaves sensibly before this phase. 


This initial “exploration phase” ( [Darken and Mood^ 1990) is where damping teehniques 
help in ways that are not apparent from the asymptotie eonvergenee theorems alone, whieh is not 
to say there are not strong mathematieal arguments that support their use (see Noeedal and Wright[ 
2006). In partieular, in the exploration phase it will often still be true that h(9 + 5) is aeeurately 


approximated by a eonvex quadratie locally within some region around 5 = 0, and that therefor 
optimization ean be most effieiently performed by minimizing a sequenee of sueh eonvex quadratie 
approximations within adaptively sized loeal regions. 

Note that well designed damping teehniques, sueh as the ones we will employ, automati- 
eally adapt to the loeal properties of the funetion, and effeetively “turn themselves off’ when the 
quadratie model beeomes a suffieiently aeeurate loeal approximation of h, allowing the optimizer 
to aehieve the desired asymptotie eonvergenee behavior ( |More|[T97^ . 


Sueeessful and theoretieally well-founded damping teehniques inelude Tikhonov damping 
(aka Tikhonov regularization, whieh is elosely eonneeted to the trust-region method) with Levenberg- 


Marquardt style adaptation (More 1978), line-searehes, and trust regions, truneation, ete., all of 
whieh tend to be mueh more effeetive in praetiee than merely applying a learning rate to the update, 
or adding n fixed multiple of the identity to the eurvature matrix. Indeed, a subset of these teeh¬ 
niques was exploited in the work ofjMartens (2010), and primitive versions of them have appeared 


implieitly in older works sueh as Beeker and LeCun (1989), and also in many reeent diagonal 


methods like that of Zeiler (2013), although often without a good understanding of what they are 


doing and why they help. 
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Crucially, more powerful 2nd-order optimizers like HF and K-FAC, whieh have the eapabil- 
ity of taking much larger steps than Ist-order methods (or methods whieh use diagonal eurvature 
matriees), require more sophistieated damping solutions to work well, and will usually completely 
fail without them, whieh is eonsistent with predietions made in various theoretieal analyses (e.g. 

). As an analogy one ean think of sueh powerful 2nd-order optimizers as 
that need more sophistieated eontrol systems than standard ears to pre¬ 
vent them from flying off the road. Arguably one of the reasons why high-powered 2nd-order opti¬ 
mization methods have historieally tended to under-perform in maehine learning applieations, and 
in neural network training in partieular, is that their designers did not understand or take seriously 
the issue of quadratie model approximation quality, and did not employ the more sophistieated and 
effeetive damping teehniques that are available to deal with this issue. 


Noeedal and Wrightl|2006 


extremely fast raeing ears 


For a detailed review and diseussion of various damping teehniques and their erueial role in 
praetieal 2nd-order optimization methods, we refer the reader to [Martens and Sutskever|p012|). 


6.2 A highly effective damping scheme for K-FAC 


Methods like HF whieh use the exaet Fisher seem to work reasonably well with an adaptive 
Tikhonov regularization teehnique where XI is added to F + pi, and where A is adapted aeeord- 
ing to Levenberg-Marquardt style adjustment rule. This eommon and well-studied method ean be 
shown to be equivalent to imposing an adaptive spherieal region (known as a “trust region”) whieh 
eonstrains the optimization of the quadratie model (e.g [Noeedal and Wrightl [2006| ). However, we 
found that this simple teehnique is insuffieient when used with our approximate natural gradient 
update proposals. In partieular, we have found that there never seems to be a “good” ehoiee for A 
that gives rise to updates whieh are of a quality eomparable to those produeed by methods that use 
the exaet Fisher, sueh as HF. 


One possible explanation for this finding is that, unlike quadratie models based on the exaet 
Fisher (or equivalently, the GGN), the one underlying K-FAC has no guarantee of being aeeurate 
up to 2nd-order. Thus, A must remain large in order to eompensate for this intrinsie 2nd-order 
inaeeuraey of the model, whieh has the side effeet of “washing out” the small eigenvalues (whieh 
represent important low-eurvature direetions). 

Fortunately, through trial and error, we were able to find a relatively simple and highly effee¬ 
tive damping seheme, whieh eombines several different teehniques, and whieh works well within 
K-FAC. Our seheme works by eomputing an initial update proposal using a version of the above 
deseribed adaptive Tikhonov damping/regularization method, and then re-sealing this aeeording 
to quadratie model eomputed using the exaet Fisher. This seeond step is made praetieal by the 
faet that it only requires a single matrix-veetor produet with the exaet Fisher, and this ean be eom¬ 
puted effieiently using standard methods. We diseuss the details of this seheme in the following 
subseetions. 
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6.3 A factored Tikhonov regularization technique 


In the first stage of our damping seheme we generate a eandidate update proposal A by applying a 
slightly modified form of Tikhononv damping to our approximate Fisher, before multiplying —Vh 
by its inverse. 

In the usual Tikhonov regularization/damping teehnique, one adds (A + r])I to the eurvature 
matrix (where rj aeeounts for the £2 regularization), whieh is equivalent to adding a term of the 

form —^^11^112 to the eorresponding quadratie model (given by M{S) with F replaeed by our 

approximation). For the bloek-diagonal approximation F of F (from Seetion 4.2) this amounts to 
adding (A + r])I (for a lower dimensional J) to eaeh of the individual diagonal bloeks, whieh gives 
modified diagonal bloeks of the form 


® Gi^i + (A + 7])! — 0 Gi^i + (A + 7])! 0 I 


( 6 ) 


Beeause this is the sum of two Kroneeker produets we eannot use the simple identity {A ®B)~^ = 
A~^ 0 anymore. Fortunately however, there are effieient teehniques for inverting sueh matri- 
ees, whieh we diseuss in detail in Appendix]^ 

If we try to ap ply t his same Tikhonov teehnique to our more sophistieated approximation F 
of F (from Seetion 4.3) by adding (A + 7])I to eaeh of the diagonal bloeks of F, it is no longer 
elear how to effieiently invert F. Instead, a solution whieh we have found works very well in 
praetiee (and whieh we also use for the bloek-diagonal approximation F), is to add TTi{\/X + 7])I 

and —(y/A + 7])I for a sealar eonstant tt* to the individual Kroneeker faetors j_i and Gi^i 

(resp.) of eaeh diagonal bloek, giving 


+ 7rj(\/A + 0 + —(a/A + 7])!^ 


(7) 


As this is a single Kroneeker produet, all of the eomputations deseribed in Seetions 4.2 and 4.3 


ean still be used here too, simply by replaeing eaeh Aj_i j_i and Gi^i with their modified versions 

Ai-i i-i + + 7])I and Gi^i H-(y/A + 7])F 

TTi 

To see why the expression in eqn.|^is a reasonable approximation to eqn.[^ note that expand¬ 
ing it gives 


Ai-i^i-i 0 Gi^i -f A -f T])1 0 Gi^i H- (y/A -f T])Ai-i^i-i 0 J -f (A -I- 7])1 0 I 

whieh differs from eqn. [^by the residual error expression 

-f 7 ])I 0 Gi^i H-(y/A -f 7 ])Ai-i^i-i 0 / 
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While the choice of vr^ = 1 is simple and can sometimes work well in practice, a slightly 
more principled choice can be found by minimizing the obvious upper bound (following from the 
triangle inequality) on the matrix norm of this residual expression, for some matrix norm || ■ ||^. 
This gives 


VTi = 




|/(g)a 




Evaluating this expression can be done efficiently for various common choices of the matrix 
norm || ■ ||„. For example, for a general B we have ||/ ® B\\f = \\B (g) I\\f = \/d||i?||i? where d is 
the height/dimension of I, and also || J ® B \\2 = \\B ® /||2 = ||-B|| 2 - 

In our experience, one of the best and must robust choices for the norm || ■ ||,; is the trace-norm, 
which for PSD matrices is given by the trace. With this choice, the formula for tt* has the following 
simple form: 

_ /tr(y4j_i^j_i)/((ij_i + 1) 

V tr(G,,)/d, 

where di is the dimension (number of units) in layer i. Intuitively, the inner fraction is just the 
average eigenvalue of divided by the average eigenvalue of Gi^i. 

Interestingly, we have found that this factored approximate Tikhonov approach, which was 
originally motivated by computational concerns, often works better than the exact version (eqn.[^ 
in practice. The reasons for this are still somewhat mysterious to us, but it may have to do with the 
fact that the inverse of the product of two quantities is often most robustly estimated as the inverse 
of the product of their individually regularized estimates. 


6.4 Re-scaling according to the exact F 

Given an update proposal A produced by multiplying the negative gradient — V/i by our approxi¬ 
mate Fisher inverse (subject to the Tikhonov technique described in the previous subsection), the 
second stage of our proposed damping scheme re-scales A according to the quadratic model M as 
computed with the exact F, to produce a final update 5 = a A. 

More precisely, we optimize a according to the value of the quadratic model 

2 

M{5) = M{aA) = + (A + + aFh^A + h{e) 

as computed using an estimate of the exact Fisher F (to which we add the £2 regularization - 1 - 
Tikhonov term (A -I- Because this is a 1 -dimensional quadratic minimization problem, the 
formula for the optimal a can be computed very efficiently as 

, -Vh^A -V/i^A 

rv = - = - 

AT(F+(A + r7)/)A ATFA + (A + 77)||A||2 
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To evaluate this formula we use the eurrent stoehastie gradient Vh (i.e. the same one used to 
produee A), and eompute matrix-veetor products with F using the input data from the same mini¬ 
batch. While using a mini-batch to compute F gets away from the idea of basing our estimate of 
the curvature on a long history of data (as we do with our approximate Fisher), it is made slightly 
less objectionable by the fact that we are only using it to estimate a single scalar quantity (A^FA). 
This is to be contrasted with methods like HF which perform a long and careful optimization of 
M{5) using such an estimate of F. 

Because the matrix-vector products with F are only used to compute scalar quantities in K- 
FAC, we can reduce their computational cost by roughly one half (versus standard matrix-vector 
products with F) using a simple trick which is discussed in Appendix]^ 

Intuitively, this second stage of our damping scheme effectively compensates for the intrinsic 
inaccuracy of the approximate quadratic model (based on our approximate Fisher) used to generate 
the initial update proposal A, by essentially falling back on a more accurate quadratic model based 
on the exact Fisher. 

Interestingly, by re-scaling A according to M{6), K-FAC can be viewed as a version of HF 
which uses our approximate Fisher as a preconditioning matrix (instead of the traditional diagonal 
preconditioner), and runs CG for only 1 step, initializing it from 0. This observation suggests run¬ 
ning CG for longer, thus obtaining an algorithm which is even closer to HF (although using a much 
better preconditioner for CG). Indeed, this approach works reasonably well in our experience, but 
suffers from some of the same problems that HF has in the stochastic setting, due its much stronger 
use of the mini-batch-estimated exact F. 

Figurej^demonstrates the effectiveness of this re-scaling technique versus the simpler method 
of just using the raw A as an update proposal. We can see that A, without being re-scaled, is a 
very poor update to 6, and won’t even give any improvement in the objective function unless the 
strength of the factored Tikhonov damping terms is made very large. On the other hand, when the 
update is re-scaled, we can afford to compute A using a much smaller strength for the factored 
Tikhonov damping terms, and overall this yields a much larger and more effective update to 9. 


6.5 Adapting A 


Tikhonov damping can be interpreted as implementing a trust-region constraint on the update 6, 
so that in particular the constraint ||(5|| < r is imposed for some r, where r depends on A and 
the curvature matrix (e.g. Nocedal and Wright[ 20061. While some approaches adjust r and then 
seek to find the matching A, it is often simpler just to adjust A directly, as the precise relationship 
between A and r is complicated, and the curvature matrix is constantly evolving as optimization 
takes place. 

The theoretically well-founded Levenberg-Marquardt style rule used by HF for doing this, 
which we will adopt for K-FAC, is given by 

if p > 3/4 then A ^ cciA 
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Figure 7: A comparison of the effectiveness of the proposed damping scheme, with and without the re¬ 
scaling techniques described in Section |6.4| The network used for this comparison is the one produced 


at iteration 500 by K-FAC (with the block-tridiagonal inverse approximation) on the MNIST autoencoder 
problem described in Section[^ The y-axis is the improvement in the objective function h (i.e. h{d) — h{6+ 
6 )) produced by the update 6 , while the x-axis is the strength constant used in the factored Tikhonov damping 
technique (which is denoted by “ 7 ” as described in Section [63] ). In the legend, “no moment.” indicates that 
the momentum technique developed for K-FAC in Section]^ (which relies on the use of re-scaling) was not 
used. 


where p = 


if p < 1/4 then A ^-A 

Ul 

_ h{e + 5)- h{e) 


M{5) 


is the “reduetion ratio” and 0 < cei < 1 is some deeay eonstant, and 


all quantities are eomputed on the eurrent mini-bateh (and M uses the exaet F). 


Intuitively, this rule tries to make A as small as possible (and henee the implieit trust-region 
as large as possible) while maintaining the property that the quadratie model M{6) remains a good 
local approximation to h (in the sense that it aeeurately prediets the value of h{9 + d) for the 5 
whieh gets ehosen at eaeh iteration). It has the desirable property that as the optimization enters 
the final eonvergenee stage where M beeomes an almost exaet approximation in a suffieiently 
large neighborhood of the loeal minimum, the value of A will go rapidly enough towards 0 that 
it doesn’t interfere with the asymptotie loeal eonvergenee theory enjoyed by 2nd-order methods 
(pordlfTWSl). 


In our experiments we applied this rule every Ti iterations of K-FAC, with oji = (19/20)^^ 
and Ti = 5, from a starting value of A = 150. Note that the optimal value of ui and the starting 
value of A may be applieation dependent, and setting them inappropriately eould signifieantly slow 
down K-FAC in praetiee. 


Computing p ean be done quite effieiently. Note that for the optimal 5, M{5) = and 

h{9) is available from the usual forward pass. The only remaining quantity whieh is needed to 
evaluate p is thus h{6 + 5), whieh will require an additional forward pass. But fortunately, we only 
need to perform this onee every Ti iterations. 
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6.6 Maintaining a separate damping strength for the approximate Fisher 


While the seheme described in the previous sections works reasonably well in most situations, we 
have found that in order to avoid certain failure cases and to be truly robust in a large variety of sit¬ 
uations, the Tikhonov damping strength parameter for the factored Tikhonov technique described 


in Section 6.3 should be maintained and adjusted independently of A. To this end we replace the 
expression a/A + r] in Section O with a separate constant 7 , which we initialize to a/A + 7 but 
which is then adjusted using a different rule, which is described at the end of this section. 

The reasoning behind this modification is as follows. The role of A, according to the Leven- 


berg Marquardt theory (More, 19781, is to be as small as possible while maintaining the property 
that the quadratic model M remains a trust-worthy approximation of the true objective. Mean¬ 
while, 7 ’s role is to ensure that the initial update proposal A is as good an approximation as 
possible to the true optimum of M (as computed using a mini-batch estimate of the exact F), so 
that in particular the re-scaling performed in Section [ 6 ^ is as benign as possible. While one might 
hope that adding the same multiple of the identity to our approximate Fisher as we do to the exact 
F (as it appears in M) would produce the best A in this regard, this isn’t obviously the case. In 
particular, using a larger multiple may help compensate for the approximation we are making to 
the Fisher when computing A, and thus help produce a more “conservative” but ultimately more 
useful initial update proposal A, which is what we observe happens in practice. 


A simple measure of the quality of our choice of 7 is the (negative) value of the quadratic 
model M{5) = M{aA) for the optimally chosen a. To adjust 7 based on this measure (or others 
like it) we use a simple greedy adjustment rule. In particular, every T 2 iterations during the op¬ 
timization we try 3 different values of 7 ( 70 , a; 27 o, and (l/ci; 2 ) 7 o, where 70 is the current value) 
and choose the new 7 to be the best of these, as measured by our quality metric. In our experi¬ 
ments we used T 2 = 20 (which must be a multiple of the constant T 3 as defined in Section [Sj), and 
a;2 = 

We have found that M (5) works well in practice as a measure of the quality of 7 , and has the 
added bonus that it can be computed at essentially no additional cost from the incidental quantities 
already computed when solving for the optimal a. In our initial experiments we found that using 
it gave similar results to those obtained by using other obvious measures for the quality of 7 , such 
as h{9 + (5). 


7 Momentum 


Sutskever et al. (20131 found that momentum (Polyak, 1964[ Plaut et al. 19861 was very helpful 


in the context of stochastic gradient descent optimization of deep neural networks. A version of 
momentum is also present in the original HF method, and it plays an arguably even more important 
role in more “stochastic” versions of HF ([Martens and Sutskever[ [20T^ |Kiros[ |20 1 3|) . 


A natural way of adding momentum to K-FAC, and one which we have found works well 
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in practice, is to take the update to be 5 = aA + where is the final update eomputed 
at the previous iteration, and where a and /x are ehosen to minimize M{5). This allows K-FAC 
to effeetively build up a better solution to the loeal quadratie optimization problem min^ M{6) 
(where M uses the exact F) over many iterations, somewhat similarly to how Matrix Momentum 
( |Searpetta et al4[T999| ) and HF do this (see |Sutskever et al!||2013| ). 

The optimal solution for a and /i ean be eomputed as 



■ATFA + (A + r/)||A||2 

A^F^o + (A + r;) A^^o 


A^F^o + (A + 7]) A^^o 
5o^F<5o + (A + r/)||5o||i 


-1 

Wh^A 




The main eost in evaluating this formula is eomputing the two matrix-veetor produets FA 
and F5o. Fortunately, the teehnique diseussed in Appendix ean be applied here to eompute 
the 4 required sealars at the eost of only two forwards passes (equivalent to the eost of only one 
matrix-veetor produet with F). 


Empirieally we have found that this type of momentum provides substantial aeeeleration in 
regimes where the gradient signal has a low noise to signal ratio, whieh is usually the ease in 
the early to mid stages of stoehastie optimization, but ean also be the ease in later stages if the 
mini-bateh size is made suffieiently large. These findings are eonsistent with predietions made by 
eonvex optimization theory, and with older empirieal work done on neural network optimization 


(LeCun et al.l 1998). 


Notably, beeause the implieit “momentum deeay eonstant” /i in our method is being eomputed 
on the fly, one doesn’t have to worry about setting sehedules for it, or adjusting it via heuristies, as 
one often does in the eontext of SGD. 


Interestingly, if h is a quadratie funetion (so the definition of M{6) remains fixed at eaeh 
iteration) and all quantities are eomputed deterministieally (i.e. without noise), then using this 
type of momentum makes K-FAC equivalent to performing preeonditioned linear CG on M{5), 
with the preeonditioner given by our approximate Fisher. This follows from the faet that linear 
CG ean be interpreted as a momentum method where the learning rate a and momentum deeay 
eoeffieient jj, are ehosen to jointly minimize M (5) at the eurrent iteration. 


8 Computational Costs and Efficiency Improvements 

Let d be the typieal number of units in eaeh layer and m the mini-bateh size. The signifieant 
eomputational tasks required to eompute a single update/iteration of K-FAC, and rough estimates 
of their assoeiated eomputational eosts, are as follows: 

1 . standard forwards and baekwards pass: 2Ciid?m 

2 . eomputation of the gradient Vh on the eurrent mini-bateh using quantities eomputed in 
baekwards pass: C 2 ^d?m 
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3. additional backwards pass with random targets (as described in Section]^: CildPm 

4. updating the estimates of the required Ai^s and Gij’s from quantities computed in the 
forwards pass and the additional randomized backwards pass: 2 C 2 i(Prn 

5. matrix inverses (or SVDs for the block-tridiagonal inverse, as described in Appendix 
required to compute the inverse of the approximate Fisher: C^id^ for the block-diagonal 
inverse, € 4^1 for the block-tridiagonal inverse 

6 . various matrix-matrix products required to compute the matrix-vector product of the approx¬ 
imate inverse with the stochastic gradient: C^id^ for the block-diagonal inverse, C^id^ for 
the block-tridiagonal inverse 

7. matrix-vector products with the exact F on the current mini-batch using the approach in 
Appendix]^ ACiid^m with momentum, 2Ciid?m without momentum 

8 . additional forward pass required to evaluate the reduction ratio p needed to apply the A 
adjustment rule described in Section [6^ Ciid?m every Ti iterations 

Here the Cj are various constants that account for implementation details, and we are assuming the 
use of the naive cubic matrix-matrix multiplication and inversion algorithms when producing the 
cost estimates. Note that it it is hard to assign precise values to the constants, as they very much 
depend on how these various tasks are implemented. 


Note that most of the computations required for these tasks will be sped up greatly by perform¬ 
ing them in parallel across units, layers, training cases, or all of these. The above cost estimates 
however measure sequential operations, and thus may not accurately reflect the true computation 
times enjoyed by a parallel implementation. In our experiments we used a vectorized implemen¬ 
tation that performed the computations in parallel over units and training cases, although not over 
layers (which is possible for computations that don’t involve a sequential forwards or backwards 
“pass” over the layers). 


Tasks and [^represent the standard stochastic gradient computation. 

The costs of tasks and are similar and slightly smaller than those of tasks and One 
way to significantly reduce them is to use a random subset of the current mini-batch of size Tim to 
update the estimates of the required Aj j’s and Gi/s. One can similarly reduce the cost of task|^by 
computing the (factored) matrix-vector product with F using such a subset of size T 2 m, although 
we recommend proceeding with caution when doing this, as using inconsistent sets of data for the 
quadratic and linear terms in M(5) can hypothetically cause instability problems which are avoided 
by using consistent data (see [Martens and Sutskever| ( |2012| ), Section 13.1). In our experiments in 
Section[^we used ti = 1/8 and T 2 = 1/4, which seemed to have a negligible effect on the quality 
of the resultant updates, while significantly reducing per-iteration computation time. In a separate 
set of unreported experiments we found that in certain situations, such as when £2 regularization 
isn’t used and the network starts heavily overfitting the data, or when smaller mini-batches were 
used, we had to revert to using r 2 = 1 to prevent significant deterioration in the quality of the 
updates. 
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The cost of taskj^can be made relatively insignificant by making the adjustment period Ti for 
A large enough. We used Ti = 5 in our experiments. 

The costs of tasks and are hard to compare directly with the costs associated with comput¬ 
ing the gradient, as their relative sizes will depend on factors such as the architecture of the neural 
network being trained, as well as the particulars of the implementation. However, one quick obser¬ 
vation we can make is that both tasks and [^involve computations that be performed in parallel 
across the different layers, which is to be contrasted with many of the other tasks which require 
sequential passes over the layers of the network. 


Clearly, if m S> d, then the cost of tasks and [^becomes negligible in comparison to the oth¬ 
ers. However, it is more often the case that m is comparable or perhaps smaller than d. Moreover, 
while algorithms for inverses and SVDs tend to have the same asymptotic cost as matrix-matrix 
multiplication, they are at least several times more expensive in practice, in addition to being 
harder to parallelize on modem GPU architectures (indeed, CPU implementations are often faster 
in our experience). Thus, C 3 and will typically be (much) larger than C 5 and Cq, and so in a 
basic/naive implementation of K-FAC, taskcan dominate the overall per-iteration cost. 


Fortunately, there are several possible ways to mitigate the cost of task As mentioned 
above, one way is to perform the computations for each layer in parallel, and even simultaneously 
with the gradient computation and other tasks. In the case of our block-tridiagonal approximation 
to the inverse, one can avoid computing any SVDs or matrix square roots by using an iterative 
Stein-equation solver (see Appendix |^. And there are also ways of reducing matrix-inversion 
(and even matrix square-root) to a short sequence of matrix-matrix multiplications using iterative 
methods (Pan and Schreiber 19911. Furthermore, because the matrices in question only change 
slowly over time, one can consider hot-starting these iterative inversion methods from previous 
solutions. In the extreme case where d is very large, one can also consider using low-rank - 1 - 
diagonal approximations of the A* ^ and Gij matrices maintained online (e.g. using a similar 
strategy as |Le Roux et al.| ( [2008| )) from which inverses and/or SVDs can be more easily computed. 
Although based on our experience such approximations can, in some cases, lead to a substantial 
degradation in the quality of the updates. 


While these ideas work reasonably well in practice, perhaps the simplest method, and the one 
we ended up settling on for our experiments, is to simply recompute the approximate Fisher inverse 
only every T 3 iterations (we used T 3 = 20 in our experiments). As it turns out, the curvature of 
the objective stays relatively stable during optimization, especially in the later stages, and so in our 
experience this strategy results in only a modest decrease in the quality of the updates. 

If m is much smaller than d, the costs associated with taskcan begin to dominate (provided 
T 3 is sufficiently large so that the cost of taskis relatively small). And unlike task|^ taskmust 
be performed at every iteration. While the simplest solution is to increase m (while reaping the 
benefits of a less noisy gradient), in the case of the block-diagonal inverse it turns out that we can 
change the cost of task from C^id^ to C^i.d?m by taking advantage of the low-rank structure of 
the stochastic gradient. The method for doing this is described below. 
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Let Ai and Qi be matrices whose columns are the m hj’s and gi's (resp.) associated with 
the current mini-batch. Let VwA denote the gradient of h with respect to Wi, shaped as a 
matrix (instead of a vector). The estimate of VwA over the mini-batch is given by ^QiAJ_i, 

computing the amounts to computing Ui = 


which is of rank-m. 


From Section 


4.2 


Gj Substituting in our mini-batch estimate of gives 


u. = G-;[ -g.AL 




Direct evaluation of the expression on the right-hand side involves only matrix-matrix multiplica¬ 
tions between matrices of size m x d and d x m (or between those of size d x d and d x m), and 
thus we can reduce the cost of taskj^to CAd'^m. 


Note that the use of standard £2 weight-decay is not compatible with this trick. This is because 
the contribution of the weight-decay term to each VwA is gWi, which will typically not be low- 
rank. Some possible ways around this issue include computing the weight-decay contribution 
r]F~^9 separately and refreshing it only occasionally, or using a different regularization method, 
such as drop-out (Hinton et al.[ 2012|) or weight-magnitude constraints. 


Note that the adjustment technique for 7 described in Section 6.6 requires that, at every T 2 
iterations, we compute 3 different versions of the update for each of 3 candidate values of 7 . In an 
ideal implementation these could be computed in parallel with each other, although in the summary 
analysis below we will assume they are computed serially. 


Summarizing, we have that with all of the various efficiency improvements discussed in this 
section, the average per-iteration computational cost of K-FAC, in terms of serial arithmetic oper¬ 
ations, is 


[(2 + Ti + 2(1 -f Xmom)A + 2/T2)r2 “f l/Ti)Ci -f (1 -f 2Ti)C2\£d?m 

-f (1 -f 2 /T 2 )[(C' 4 /T 3 + C^Xtri + — Xtri)]£d^ -f (1 -f 2 /T 2 )C 5 (l — Xtri)£d'^ min{d, m} 

where Xmom, Xtri £ {0,1} are flag variables indicating whether momentum and the block-tridiagonal 
inverse approximation (resp.) are used. 

Plugging in the values of these various constants that we used in our experiments, for the 
block-diagonal inverse approximation (xtri = 0 ) this becomes 

(3.425Ci -f 1.2bC2)£(f‘rn -f O.OSSC'sf'd^ -f l.lCAd^ min{(i, m} 


and for the block-tridiagonal inverse approximation (xtri = 1 ) 

(3.425Ci + 1.25C2)id^m + (0.055C'4 + l.lCe)id^ 


which is to be compared to the per-iteration cost of SGD, as given by 

(2C'i + C2)id‘^m 
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9 Pseudocode for K-FAC 

Algorithm[^gives high-level pseudoeode for the K-FAC method, with the details of how to perform 
the eomputations required for eaeh major step left to their respeetive seetions. 


Algorithm 2 High-level pseudoeode for K-FAC 


Initialize 6i (e.g. using a good method such as the ones described in Martens (20101 or Glorot and 


Bengio (2010)) 


• Choose initial values of A (err on the side of making it too large) 

• 7 V^ + V 

• A: 1 

while 9k is not satisfactory do 

• Choose a mini-batch size m (e.g. using a fixed value, an adaptive rule, or some predefined schedule) 

• Select a random mini-batch 5' C 5 of training cases of size m 

• Select a random subset 5i C S' of size ri|5'| 

• Select a random subset S 2 C S' of size r 2 |S''| 

• Perform a forward and backward pass on S' to estimate the gradient Vh{9k) (see Algorithm[^ 

• Perform an additional backwards pass on Si using random targets generated from the model’s pre¬ 
dictive distribution (as described in Section]^ 

• Update the estimates of the required A* j’s and Gjj’s using the Uj’s computed in forward pass for Si, 
and the pj’s computed in the additional backwards pass for Si (as described Section]^ 

• Choose a set F of new candidate 7 ’s as described in Section [ 6 ^ (setting F = { 7 } if not adjusting 7 
at this iteration, i.e. if k ^ 0 (mod T 2 ) ) 

for each 7 G F do 

if recomputing the approximate Fisher inverse this iteration (i.e. if A: = 0 (mod T 3 ) or A: < 3) 


then 


Compute the approximate Fisher inverse (using the formulas derived in Section |4~2] or Section 


4.31) from versions of the current Aij ’s and Gij ’s which are modified as per the factored Tikhonov 


damping technique described in Section [63] (but using 7 as described in Section [63] ) 

end if 

• Compute the update proposal A by multiplying current estimate of approximate Fisher inverse by 
the estimate of Vh (using the formulas derived in Section [43] or Section [43] ). For layers with size 
d < m consider using trick described at the end of Section]^ for increased efficiency. 

• Compute the final update 5 from A as described in Section [Mj (or Section]^ if using momentum) 
where the matrix-vector products with F are estimated on S 2 using the Oj’s computed in the forward 
pass 

end for 

• Select the S and the new 7 computing in the above loop that correspond to the lowest value of M{6) 
(see Section [631 ) 

if updating A this iteration (i.e. if A: = 0 (mod Ti)) then 

• Update A with the Levenberg-Marquardt style rule described in Section [63] 

end if 

• 9k+i ^ 9k + 6 

• k k + 1 

end while 
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10 Invariance Properties and the Relationship to Whitening 
and Centering 


When computed with the exact Fisher, the natural gradient specifies a direction in the space of 
predictive distributions which is invariant to the specific way that the model is parameterized. This 
invariance means that the smooth path through distribution space produced by following the natural 
gradient with infinitesimally small steps will be similarly invariant. 


For a practical natural gradient based optimization method which takes large discrete steps 
in the direction of the natural gradient, this invariance of the optimization path will only hold 
approximately. As shown by [Martens (2014), the approximation error will go to zero as the effects 
of damping diminish and the reparameterizing function ( tends to a locally linear function. Note 
that the latter will happen as ( becomes smoother, or the local region containing the update shrinks 
to zero. 


Because K-FAC uses an approximation of the natural gradient, these invariance results are not 


applicable in our case. Fortunately, as was shown by Martens (2014), one can establish invariance 
of an update direction with respect to a given reparameterization of the model by verifying certain 
simple properties of the curvature matrix C used to compute the update. We will use this result 
to show that, under the assumption that damping is absent (or negligible in its affect), K-FAC is 
invariant to a broad and natural class of transformations of the network. 

This class of transformations is given by the following modified network definition (c.f. the 
definition in Section [2T| ): 

a* = SJi.^i(4js5) 


where (pi is the function that computes (pi and then appends a homogeneous coordinate (with value 
1 ), flj and <l)i are arbitrary invertible matrices of the appropriate sizes (except that we assume 
Vt(, = I), dg = ^otto, and where the network’s output is given by f^{x, 9) = aj. Note that because 
Qi multiplies 0j(<l)jsj), it can implement arbitrary translations of the unit activities (pi(^isl) in 
addition to arbitrary linear transformations. Figure illustrates our modified network definition 
for £ = 2 (c.f. Figure [^. 

Here, and going forward, we will add a “f” superscript to any network-dependent quantity in 
order to denote the analogous version of it computed by the transformed network. Note that under 
this identification, the loss derivative formulas for the transformed network are analogous to those 
of the original network, and so our various Fisher approximations are still well defined. 


33 













CL2 


000 



000000000 



0000000 


Figure 8: A depiction of a transformed network for £ = 2. Note that the quantities labeled with dj and Si 
(without “t”) will be equal to the analogous quantities from the default network, provided that 6 = C(^^) 
in Theorem [T] 
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The following theorem deseribes the main teehnieal result of this seetion. 

Theorem 1. There exists an invertible linear function 9 = C(^^) that /^(x, 9t) = f(x,e) = 
cind thus the transformed network can be viewed as a reparameterization of the orig¬ 
inal network by 6\ Moreover, additively updating 9 by 5 = —aF~^Vh or 5 = —aF~^Vh in the 
original network is equivalent to additively updating h"' by 6^ = —or 6^ = — 

(resp.) in the transformed network, in the sense that (^{9"' + 6"^) = 9 + 6. 


This immediately implies the following eorollary whieh eharaeterizes the invarianee of a basie 
version of K-FAC to the given elass of network transformations. 

Corollary 2. The optimization path taken by K-FAC (using either of our Fisher approximations 
F or F) through the space of predictive distributions is the same for the default network as it is 
for the transformed network (where the klfs and $*’5 remain fixed). This assumes the use of an 
equivalent initialization (9 q = C(^o)A a basic version of K-FAC where damping is absent or 
negligible in effect, momentum is not used, and where the learning rates are chosen in a way that 
is independent of the network’s parameterization. 


While this eorollary assumes that the flfs and <f)j’s are fixed, if we relax this assumption so 
that they are allowed to vary smoothly with 9, then ( will be a smooth funetion of 9, and so as 
diseussed in Martens (2014), invarianee of the optimization path will hold approximately in a way 
that depends on the smoothness of ( (whieh measures how quiekly the flj’s and 4>i’s ehange) and 
the size of the update. Moreover, invarianee will hold exaetly in the limit as the learning rate goes 
to 0. 


Note that the network transformations ean be interpreted as replaeing the network’s nonlin¬ 
earity (j)i{si) at eaeh layer i with a “transformed” version So sinee the well-known 

logistie sigmoid and tanh funetions are related to eaeh other by sueh a transformation, an immedi¬ 
ate eonsequenee of Corollary [^is that K-FAC is invariant to the ehoiee of logistie sigmoid vs. tanh 
aetivation funetions (provided that equivalent initializations are used and that the effeet of damping 
is negligible, ete.). 

Also note that beeause the network inputs are also transformed by K-FAC is thus in¬ 
variant to arbitrary affine transformations of the input, whieh ineludes many popular training data 
preproeessing teehniques. 

Many other natural network transformations, sueh as ones whieh “eenter” and normalize unit 
aetivities so that they have mean 0 and varianee 1 ean be deseribed using diagonal ehoiees for the 
klfs and $j’s whieh vary smoothly with 9. In addition to being approximately invariant to sueh 
transformations (or exaetly, in the limit as the step size goes to 0), K-FAC is similarly invariant to 
a more general elass of sueh transformations, sueh as those whieh transform the units so that they 
have a mean of 0, so they are “eentered”, and a covariance matrix of I, so they are “whitened”, 
whieh is a mueh stronger eondition than the varianees of the individual units eaeh being 1. 

In the ease where we use the bloek-diagonal approximation F and eompute updates without 
damping. Theorem affords us an additional elegant interpretation of what K-FAC is doing. In 
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particular, the updates produeed by K-FAC end up being equivalent to those produeed by standard 
gradient descent using a network whieh is transformed so that the unit aetivities and the unit- 
gradients are both eentered and whitened (with respeet to the model’s distribution). This is stated 
formally in the following eorollary. 

Corollary 3. Additively updating 9 by —aF~^Vh in the original network is equivalent to addi- 
tively updating 6*^ by the gradient descent update —(where 9 = C(^^) in Theorem^ in a 
transformed version of the network where the unit activities a| and the unit-gradients g\ are both 
centered and whitened with respect to the model’s distribution. 


11 Related Work 


The Hessian-free optimization method of Martens] ( |2010[ ) uses linear eonjugate gradient (CG) to 
optimize loeal quadratie models of the form of eqn. (subjeet to an adaptive Tikhonov damping 
teehnique) in lieu of direetly solving it using matrix inverses. As diseussed in the introduetion, the 
main advantages of K-FAC over HF are twofold. Firstly, K-FAC uses an effieiently eomputable 
direet solution for the inverse of the eurvature matrix and thus avoids the eostly matrix-veetor 
produets assoeiated with running CG within HF. Seeondly, it ean estimate the eurvature matrix 
from a lot of data by using an online exponentially-deeayed average, as opposed to relatively 
small-sized fixed mini-batehes used by HF. The eost of doing this is of eourse the use of an inexaet 
approximation to the eurvature matrix. 


Le Roux et al. (2008) proposed a neural network optimization method known as TONGA 


based on a bloek-diagonal approximation of the empirical Fisher where eaeh bloek eorresponds to 
the weights assoeiated with a partieular unit. By eontrast, K-FAC uses much larger bloeks, eaeh 
of whieh eorresponds to all the weights within a partieular layer. The matriees whieh are inverted 
in K-FAC are roughly the same size as those whieh are inverted in TONGA, but rather than there 
being one per unit as in TONGA, there are only two per layer. Therefore, K-FAC is signifieantly 
less eomputationally intensive than TONGA, despite using what is arguably a mueh more aeeurate 
approximation to the Fisher. Note that to help mitigate the eost of the many matrix inversions it 
requires, TONGA approximates the bloeks as being low-rank plus a diagonal term, although this 
introduees further approximation error. 


Centering methods work by either modifying the gradient (Sehraudolph 

1998 

) or dynami- 

eally reparameterizing the network itself (|Raiko et al.[ 2012[ Vatanen et al. 

2013; 

Wiesler et al.[ 


2014), so that various unit-wise sealar quantities like the aetivities (the Uj’s) and loeal derivatives 
(the 0'(sj)’s) are 0 on average (i.e. “eentered”), as they appear in the formula for the gradient. Typ- 
ieally, these methods require the introduetion of additional “skip” eonneetions (whieh bypass the 
nonlinearities of a given layer) in order to preserve the expressive power/effieieney of the network 
after these transformations are applied. 


It is argued by Raiko et al. (2012) that the applieation of the eentering transformation makes 
the Fisher of the resulting network eloser to a diagonal matrix, and thus makes its gradient more 
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closely resemble its natural gradient. However, this argument uses the strong approximating as¬ 
sumption that the eorrelations between various network-dependent quantities, sueh as the aetivities 
of different units within a given layer, are zero. In our notation, this would be like assuming that 
the Gi/s are diagonal, and that the Aj j’s are rank-1 plus a diagonal term. Indeed, using sueh 
an approximation within the bloek-diagonal version of K-FAC would yield an algorithm similar 
to standard eentering, although without the need for skip eonneetions (and henee similar to the 
version of eentering proposed by |Wiesler et al.| ( [2()14[ )). 

As shown in Corollary K-FAC ean also be interpreted as using the gradient of a transformed 
network as its update direetion, although one in whieh the gi's and a^’s are both eentered and 
whitened (with respeet to the model’s distribution). Intuitively, it is this whitening whieh aeeounts 
for the eorrelations between aetivities (or baek-propagated gradients) within a given layer. 


Ollivier (2013|) proposed a neural network optimization method whieh uses a bloek-diagonal 


approximation of the Fisher, with the bloeks eorresponding to the ineoming weights (and bias) of 
eaeh unit. This method is similar to TONGA, exeept that it approximates the Fisher instead of the 


empirieal Fisher (see Martens (2014) for a diseussion of the differenee between these). Beeause 


eomputing bloeks of the Fisher is expensive (it requires k baekpropagations, where k is the number 
of output units), this method uses a biased deterministie approximation whieh ean be eomputed 


more effieiently, and is similar in spirit to the deterministie approximation used by LeCun et al. 


(1998). Note that while sueh an approximation eould hypothetieally be used within K-FAC to 


eompute the Gi/s, we have found that our basie unbiased stoehastie approximation works nearly 
as well as the exaet values in praetiee. 


The work most elosely related to ours is that of Heskes (2000), who proposed an approx¬ 
imation of the Fisher of feed-forward neura l networks similar to our Kroneeker-faetored bloek- 
diagonal approximation F from Seetion 4.2 and used it to derive an effieient approximate natural- 
gradient based optimization method by exploiting the identity {A ® B)~^ = A~^ ® B~^. K-FAC 
differs from Heskes’ method in several important ways whieh turn out to be erueial to it working 
well in praetiee. 

In Heskes’ method, update damping is aeeomplished using a basie faetored Tikhonov teeh- 
nique where 7 / is added to eaeh Gi^i and A, j for a fixed parameter 7 > 0 whieh is set by hand. By 
eontrast, K-FAC uses a faetored Tikhonov teehnique where 7 adapted dynamieally as deseribed 


in Seetion 6 . 6 , eombined with a re-sealing teehnique based on a loeal quadratie model eomputed 


using the exaet Fisher (see Seetion [ 6 A] ). Note that the adaptation of 7 is important sinee what eon- 
stitutes a good or even merely aeeeptable value of 7 will ehange signifieantly over the eourse of 
optimization. And the use of our re-sealing teehnique, or something similar to it, is also erueial as 
we have observed empirieally that basie Tikhonov damping is ineapable of produeing high quality 
updates by itself, even when 7 is ehosen optimally at eaeh iteration (see Figure|^of Seetion [ 6 A] ). 

Also, while Heskes’ method eomputes the Gi/s exaetly, K-FAC uses a stoehastie approxima¬ 
tion whieh seales effieiently to neural networks with mueh higher-dimensional outputs (see Seetion 

0 - 
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Other advances we have introduced include the more accurate block-tridiagonal approxima¬ 
tion to the inverse Fisher, a parameter-free type of momentum (see Section]^, online estimation of 
the Gi^i and A* ^ matrices, and various improvements in computational efficiency (see Section [^. 
We have found that each of these additional elements is important in order for K-FAC to work as 
well as it does in various settings. 


Concurrently with this work Povey et al. (20151 has developed a neural network optimization 
method which uses a block-diagonal Kronecker-factored approximation similar to the one from 
Heskes] (2000). This approach differs from K-FAC in numerous ways, including its use of the 


empirical Fisher (which doesn’t work as well as the standard Fisher in our experience - see Section 
[^, and its use of only a basic factored Tikhonov damping technique without adaptive re-scaling 
or any form of momentum. One interesting idea introduced by Povey et al. ( 2015| ) is a particular 
method for maintaining an online low-rank plus diagonal approximation of the factor matrices for 
each block, which allows their inverses to be computed more efficiently (although subject to an 
approximation). While our experiments with similar kinds of methods for maintaining such online 
estimates found that they performed poorly in practice compared to the solution of refreshing the 
inverses only occasionally (see Section [^, the particular one developed by [Povey et'aL] ( |2015| ) 
could potentially still work well, and may be especially useful for networks with very wide layers. 


12 Heskes’ interpretation of the block-diagonal approximation 


Heskes|(2000) discussed an alternative interpretation of the block-diagonal approximation which 


yields some useful insight to complement our own theoretical analysis. In particular, he observed 
that the block-diagonal Fisher approximation F is the curvature matrix corresponding to the fol¬ 
lowing quadratic function which measures the difference between the new parameter value 9' and 
the current value 9: 


0(0 '.«) = 5 E E [(s. - - s')] 

i=l 

Here, s' = W'ai-i, and the s/s and dj’s are determined by 9 and are independent of 9' (which 
determines the VF/’s). 

D{9\ 9) can be interpreted as a reweighted sum of squared changes of each of the s/s. The 
reweighing matrix Gi^i is given by 


Gi,i — E [gigj] — E 


Fp(i) 
a I Si 


where is the network’s predictive distribution as parameterized by s*, and FpH) is its Fisher 

information matrix, and where the expectation is taken w.r.t. the distribution on Sj (as induced 
by the distribution on the network’s input x). Thus, the effect of reweighing by the Gi/s is to 
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(approximately) translate changes in Sj into changes in the predictive distribution over y, although 
using the expected/average Fisher Gi^i = E[Fp(i) ] instead of the more specific Fisher Fp(i) . 

v\Bi v\si 

Interestingly, if one used F (i) instead of Ga in the expression for D{9',9), then D(9',9) 

v\‘>i 

would correspond to a basic layer-wise block-diagonal approximation of F where the blocks are 
computed exactly (i.e. without the Kronecker-factorizing approximation introduced in Section]^. 
Such an approximate Fisher would have the interpretation of being the Hessian w.r.t. 9' of either of 
the measures 


i=l 


'E 


KL P 


>(0 
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h) 


y\si II y\s[ 


or 


i=l 


'E 


KL P 
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(h 


yWi II y\si 


Note that each term in either of these sums is a function measuring an intrinsic quantity (i.e. 
changes in the output distribution), and so overall these are intrinsic measures except insofar as 
they assume that 9 is divided into i independent groups that each parameterize one of the t different 
predictive distributions (which are each conditioned on their respective Oj-i’s). 


It is not clear whether P, with its Kronecker-factorizing structure can similarly be interpreted 
as the Hessian of such a self-evidently intrinsic measure. If it could be, then this would consider¬ 
ably simplify the proof of our Theorem [^(e.g. using the techniques of Arnold et al. (20111). Note 
that D{9',9) itself doesn’t work, as it isn’t obviously intrinsic. Despite this, as shown in Section 


10 both P and our more advanced approximation P produce updates which have strong invariance 


properties. 


13 Experiments 


To investigate the practical performance of K-FAC we applied it to the 3 deep autoencoder opti¬ 


mization problems from Hinton and Salakhutdinov| ( |2006] ), which use the “MNIST”, “CURVES”, 
and “FACES” datasets respectively (see Hinton and Salakhutdinov (|2006 ) for a complete de¬ 


scription of the network architectures and datasets). Due to their high difficulty, performance on 
these problems has become a standard benchmark for neural network optimization methods (e.g. 



As our baseline we used the version of SGD with momentum based on Nesterov’s Accelerated 
Gradient ( [Nesterov]|1983| ) described in jSutskever et al.| ( |2013| ), which was calibrated to work well 
on these particular deep autoencoder problems. Eor each problem we followed the prescription 


given by Sutskever et al. (20131 for determining the learning rate, and the increasing schedule for 
the decay parameter y. We did not compare to methods based on diagonal approximations of the 
curvature matrix, as in our experience such methods tend not perform as well on these kinds of 
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optimization problems as the baseline does (an observation whieh is eonsistent with the findings 
of |Sehraudolph| ( [2()02[ ); |Zeiler| ( |2013| )). 


Our implementation of K-FAC used most of the effieieney improvements deseribed in Seetion 
exeept that all “tasks” were eomputed serially (and thus with better engineering and more hard¬ 
ware, a faster implementation eould likely be obtained). Beeause the mini-bateh size m tended 
to be eomparable to or larger than the typieal/average layer size d, we did not use the teehnique 
deseribed at the end of Seetion for aeeelerating the eomputation of the approximate inverse, as 
this only improves effieieney in the ease where m < d, and will otherwise deerease effieieney. 


Both K-FAC and the baseline were implemented using veetorized MATLAB eode aeeelerated 
with the GPU paekage Jaeket. The eode for K-FAC is available for downloacQ All tests were 
performed on a single eomputer with a 4.4 Ghz 6 eore Intel CPU and an NVidia GTX 580 GPU 
with 3GB of memory. Eaeh method used the same initial parameter setting, whieh was generated 


using the “sparse initialization” teehnique from Martens||(2010) (whieh was also used by Sutskever 


et al. (2013)). 


To help mitigate the detrimental effeet that the noise in the stoehastie gradient has on the 
eonvergenee of the baseline (and to a lesser extent K-FAC as well) we used a exponentially de- 


eayed iterate averaging approaeh based loosely on Polyak averaging (e.g. [Swersky et aL| |20I0| ). 
In partieular, at eaeh iteration we took the “averaged” parameter estimate to be the previous sueh 
estimate, multiplied by plus the new iterate produeed by the optimizer, multiplied by 1 — for 
^ = 0.99. Sinee the training error assoeiated with the optimizer’s eurrent iterate may sometimes 
be lower than the training error assoeiated with the averaged estimate (whieh will often be the ease 
when the mini-bateh size m is very large), we report the minimum of these two quantities. 


To be eonsistent with the numbers given in previous papers we report the reeonstruetion error 
instead of the aetual objeetive funetion value (although these are almost perfeetly eorrelated in our 
experienee). And we report the error on the training set as opposed to the test set, as we are ehiefly 
interested in optimization speed and not the generalization eapabilities of the networks themselves. 


In our first experiment we examined the relationship between the mini-bateh size m and the 
per-iteration rate of progress made by K-FAC and the baseline on the MNIST problem. The results 
from this experiment are plotted in Figure They strongly suggest that the per-iteration rate 
of progress of K-FAC tends to a superlinear funetion of m (whieh ean be most elearly seen by 
examining the plots of training error vs training oases prooessed), whieh is to be oontrasted with 
the baseline, where inoreasing m has a muoh smaller effeet on the per-iteration rate of progress, and 
with K-FAC without momentum, where the per-iteration rate of progress seems to be a linear or 
slightly sublinear funetion of m. It thus appears that the main limiting faetor in the eonvergenee of 
K-FAC (with momentum applied) is the noise in the gradient, at least in later stages of optimization, 
and that this is not true of the baseline to nearly the same extent. This would seem to suggest that 
K-FAC, mueh more than SGD, would benefit from a massively parallel distributed implementation 
whieh makes use of more eomputational resourees than a single GPU. 

“http://WWW.cs.toronto.edu/-jmartens/docs/KFAC3-MATLAB.zip 
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Figure 9: Results from our first experiment examining the relationship between the mini-batch size m and 
the per-iteration progress (left column) or per-training case progress (right column) progress made by K- 
FAC on the MNIST deep autoencoder problem. Here, “Blk-TriDiag K-FAC” is the block-tridiagonal version 
of K-FAC, and “Blk-Diag K-FAC” is the block-diagonal version, and “no moment.” indicates that momen¬ 
tum was not used. The bottom row consists of zoomed-in versions of the right plot from the row above it, 
with the left plot concentrating on the beginning stage of optimization, and the right plot concentrating on 
the later stage. Note that the x-axes of these two last plots are at significantly different scales (10® vs 10^). 
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But even in the single CPU/GPU setting, the fact that the per-iteration rate of progress tends 
to a superlinear function of m, while the per-iteration computational cost of K-FAC is a roughly 
linear function of m, suggests that in order to obtain the best per-second rate of progress with K- 
FAC, we should use a rapidly increasing schedule for m. To this end we designed an exponentially 
increasing schedule for m, given by = min(mi exp((A; — l)/6), 151), where k is the current 
iteration, mi = 1000, and where b is chosen so that msoo = |5|. The approach of increasing the 
mini-batch size in this way is analyzed by Friedlander and Schmidt ( 2012| ). Note that for other 
neural network optimization problems, such as ones involving larger training datasets than these 
autoencoder problems, a more slowly increasing schedule, or one that stops increasing well before 
m reaches |5|, may be more appropriate. One may also consider using an approach similar to that 


of Byrd et al. (2012) for adaptively determining a suitable mini-batch size. 


In our second experiment we evaluated the performance of our implementation of K-FAC 
versus the baseline on all 3 deep autoencoder problems, where we used the above described ex¬ 
ponentially increasing schedule for m for K-FAC, and a fixed setting of m for the baseline and 
momentum-less K-FAC (which was chosen from a small range of candidates to give the best over¬ 
all per-second rate of progress). The relatively high values of m chosen for the baseline (m = 250 
for CURVES, and m = 500 for MNIST and FACES, compared to the m = 200 which was used 


by Sutskever et al. (20131) reflect the fact that our implementation of the baseline uses a high- 
performance GPU and a highly optimized linear algebra package, which allows for many training 
cases to be efficiently processed in parallel. Indeed, after a certain point, making m much smaller 
didn’t result in a significant reduction in the baseline’s per-iteration computation time. 


Note that in order to process the very large mini-batches required for the exponentially in¬ 
creasing schedule without overwhelming the memory of the GPU, we partitioned the mini-batches 
into smaller “chunks” and performed all computations involving the mini-batches, or subsets 
thereof, one chunk at a time. 

The results from this second experiment are plotted in Eigures [T^ and Eor each problem 
K-EAC had a per-iteration rate of progress which was orders of magnitude higher than that of the 
baseline’s (Eigure[^, provided that momentum was used, which translated into an overall much 
higher per-second rate of progress (EigurefTOj), despite the higher cost of K-EAC’s iterations (due 
mostly to the much larger mini-batch sizes used). Note that Polyak averaging didn’t produce a 
significant increase in convergence rate of K-EAC in this second experiment (actually, it hurt a bit) 
as the increasing schedule for m provided a much more effective (although expensive) solution to 
the problem of noise in the gradient. 


The importance of using some form of momentum on these problems is emphasized in these 
experiments by the fact that without the momentum technique developed in Section K-EAC 
wasn’t significantly faster than the baseline (which itself used a strong form of momentum). These 
results echo those of |Sutskever et al.| ( |2013[ ), who found that without momentum, SGD was orders 
of magnitude slower on these particular problems. Indeed, if we had included results for the 
baseline without momentum they wouldn’t even have appeared in the axes boundaries of the plots 
in Eigure[T0| 
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Figure 10: Results from our second experiment showing training error versus computation time on the 
CURVES (top), MNIST (middle), and FACES (bottom) deep autoencoder problems. 
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Figure 11: More results from our second experiment showing training error versus iteration on the 
CURVES (top row), MNIST (middle row), and FACES (bottom row) deep autoencoder problems. The 
plots on the right are zoomed in versions of those on the left which highlight the difference in per-iteration 
progress made by the different versions of K-FAC. 
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Recall that the type of momentum used by K-FAC compensates for the inexactness of our 
approximation to the Fisher by allowing K-FAC to build up a better solution to the exact quadratic 
model minimization problem (defined using the exact Fisher) across many iterations. Thus, if we 
were to use a much stronger approximation to the Fisher when computing our update proposals 
A, the benefit of using this type of momentum would have likely been much smaller than what we 
observed. One might hypothesize that it is the particular type of momentum used by K-FAC that 
is mostly responsible for its advantages over the SGD baseline. However in our testing we found 
that for SGD the more conventional type of momentum used by Sutskever et al. ( 2013| ) performs 
significantly better. 


From Figure[^we can see that the block-tridiagonal version of K-FAC has a per-iteration rate 
of progress which is typically 25% to 40% larger than the simpler block-diagonal version. This 
observation provides empirical support for the idea that the block-tridiagonal approximate inverse 
Fisher F~^ is a more accurate approximation of F~^ than the block-diagonal approximation F~^. 
However, due to the higher cost of the iterations in the block-tridiagonal version, its overall per- 
second rate of progress seems to be only moderately higher than the block-diagonal version’s, 
depending on the problem. 


Note that while matrix-matrix multiplication, matrix inverse, and SVD computation all have 
the same computational complexity, in practice their costs differ significantly (in increasing order 
as listed). Computation of the approximate Fisher inverse, which is performed in our experiments 
once every 20 iterations (and for the first 3 iterations), requires matrix inverses for the block- 
diagonal version, and SVDs for the block-tridiagonal version. For the FACES problem, where the 
layers can have as many as 2000 units, this accounted for a significant portion of the difference 
in the average per-iteration computational cost of the two versions (as these operations must be 
performed on 2000 x 2000 sized matrices). 

While our results suggest that the block-diagonal version is probably the better option overall 
due to its greater simplicity (and comparable per-second progress rate), the situation may be dif¬ 
ferent given a more efficient implementation of K-FAC where the more expensive SVDs required 
by the tri-diagonal version are computed approximately and/or in parallel with the other tasks, or 
perhaps even while the network is being optimized. 

Our results also suggest that K-FAC may be much better suited than the SGD baseline for a 
massively distributed implementation, since it would require far fewer synchronization steps (by 
virtue of the fact that it requires far fewer iterations). 


14 Conclusions and future directions 

In this paper we developed K-FAC, an approximate natural gradient-based optimization method. 
We started by developing an efficiently invertible approximation to a neural network’s Fisher in¬ 
formation matrix, which we justified via a theoretical and empirical examination of the statistics 
of the gradient of a neural network. Then, by exploiting the interpretation of the Fisher as an 
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approximation of the Hessian, we designed a developed a eomplete optimization algorithm using 
quadratie model-based damping/regularization teehniques, whieh yielded a highly effeetive and 
robust method virtually free from the need for hyper-parameter tuning. We showed the K-FAC 
preserves many of natural gradient deseent’s appealing theoretieal properties, sueh as invarianee 
to eertain reparameterizations of the network. Finally, we showed that K-FAC, when eombined 
with a form of momentum and an inereasing sehedule for the mini-bateh size m, far surpasses the 
performanee of a well-tuned version of SGD with momentum on diffleult deep auto-eneoder opti¬ 
mization benehmarks (in the setting of a single GPU maehine). Moreover, our results demonstrated 
that K-FAC requires orders of magnitude fewer total updates/iterations than SGD with momentum, 
making it ideally suited for a massively distributed implementation where synehronization is the 
main bottleneek. 

Some potential direetions for future development of K-FAC inelude: 

• a better/more-prineipled handling of the issue of gradient stoehastieity than a pre-determined 
inereasing sehedule for m 

• extensions of K-FAC to reeurrent or eonvolutional arehiteetures, whieh may require speeial- 
ized approximations of their assoeiated Fisher matriees 

• an implementation that better exploits opportunities for parallelism deseribed in Seetionj^ 

• exploitation of massively distributed eomputation in order to eompute high-quality estimates 
of the gradient 
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A Derivation of the expression for the approximation from Sec¬ 


tion 3.1 


In this section we will show that 

E -E E [g^^^g^^^] 

= K{a^^\ a^‘^\ g^^\ + K{a!'^^)K{aP\ g^^\ + K{a^‘^^)K{a!'^\ g^^\ 

The only specific property of the distribution over 0 *-^^ a^‘^\ g^^\ and which we will 
require to do this is captured by the following lemma. 

Lemma 4. Suppose u is a scalar variable which is independent of y when conditioned on the 
network’s output f{x, 6), and v is some intermediate quantity computed during the evaluation of 
f{x, 9) (such as the activities of the units in some layer). Then we have 

E [u Vv] = 0 

Our proof of this lemma (which is at the end of this section) makes use of the fact that the 
expectations are taken with respect to the network’s predictive distribution Py\x as opposed to the 
training distribution Qy\x- 

Intuitively, this lemma says that the intermediate quantities computed in the forward pass of 
Algorithm (or various functions of these) are statistically uncorrelated with various derivative 
quantities computed in the backwards pass, provided that the targets y are sampled according to 
the network’s predictive distribution Py\x (instead of coming from the training set). Valid choices 
for u include — E for k G {1,2}, and products of these. Examples of invalid 

choices for u include expressions involving g^^\ since these will depend on the derivative of the 
loss, which is not independent of y given /(x, 9). 

According to a well-known general formula relating moments to cumulants we may write 
E as a sum of 15 terms, each of which is a product of various cumulants corre¬ 

sponding to one of the 15 possible ways to partition the elements of {d*^^\ aP‘\ g^^\ into non¬ 
overlapping sets. For example, the term corresponding to the partition {{d^^^}, g^^\ is 

Observing that Ist-order cumulants correspond to means and 2nd-order cumulants correspond 
to covariances, for fc G (1, 2} Lemmagives 

K{g^^^) = E = E [Vx^^^] = 0 

where = [xi]k 2 , and x^^^ = (so that g^’^^ = Vx^^^). And similarly for k,m & {1,2} it 
gives 

K{d^’‘\g^^^) = E [(d^”*) - E [d(™)]) - E = E [(d^™) - E [d^”*)]) g^'‘^] = 0 
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Using these identities we ean eliminate 10 of the terms. 

The remaining expression for E is thus 

K{a^^\ a^‘^\ + K{a!'^'^)K{aP‘\ g^^\ g^^\ 


Noting that 

= Cov(d('\ d(2)) E ^ E ^-(1)] ^-(2)] ^^(1)^(2)] ^ E ^-(l)-(2)] ^^(1)^(2)] 

it thus follows that 

E [d(i)d(2) ^(1)^(2)] _E [d(')d(2)] E [g^^^g^^^] 

as required. 

It remains to prove Lemma |^ 

Proof of Lemma^ The ehain rule gives 


Vv = 


dlogp(?/|a;, 0) d\ogr{ii\z) 


dv 


dz 


z=f{x,d) 


d/(x,g) 

dv 


From whieh it follows that 


E [u Vv] = Eq^ Ep^i^ [u Vv] 


= E 


Qx 


[uVv] 


= E 


Qx 


E 


R 


v\f(x,e) 


-U 


dlogr(|/|^) 


v\S(x,e) 
T 


= E 


Qx 


—uE 


Ry\f{x,B) 


dz 

dlogr(|/|z) 




z=f(x,e) 




d/(x,g) 

dn 


d/(a^,^) 

du 


= E 


Qx 


-mO' 


d/(x,g) 

dw 


= 0 


That the inner expeetation above is 0 follows from the faet that the expeeted seore of a distri¬ 
bution, when taken with respeet to that distribution, is 0. 

□ 
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B Efficient techniques for inverting B ^ C ^ D 


It is well known that {A ® B)~^ = A~^ ® and that matrix-vector products with this matrix 
can thus be computed as (^4“^ ® B~^)v = Yec{B~^V A~^), where V is the matrix representation 
of V (so that V = vec(V)). 

Somewhat less well known is that there are also formulas for (A 0 5 ± C 0 D)~^ which can 
be efficiently computed and likewise give rise to efficient methods for computing matrix-vector 
products. 


First, note that (A ^ B ± C ^ D)~^v = u is equivalent to (A 0 i? ± C ® D)u = v, which is 
equivalent to the linear matrix equation BBA~’'±D [/= V, where u = vec{U) and v = vec(l/). 
This is known as a generalized Stein equation, and different examples of it have been studied in 
the control theory literature, where they have numerous applications. For a recent survey of this 
topic, see|Simoncini||(|201^. 


One well-known class of methods called Smith-type iterations (Smith 19681 involve rewriting 
this matrix equation as a fixed point iteration and then carrying out this iteration to convergence. 
Interestingly, through the use of a special squaring trick, one can simulate 2^ of these iterations 
with only 0{j) matrix-matrix multiplications. 


Another class of methods for solving Stein equations involves the use of matrix decomposi¬ 
tions (e.g. Chul 1987; Gardiner et al.[ 1992|). Here we will present such a method particularly well 


suited for our application, as it produces a formula for (A ® S -f G ® D)~^v, which after a fixed 
overhead cost (involving the computation of SVDs and matrix square roots), can be repeatedly 
evaluated for different choices of v using only a few matrix-matrix multiplications. 


We will assume that A, B,C, and D are symmetric positive semi-definite, as they always are 
in our applications. We have 




Inverting both sides of the above equation gives 

(A 0 5 ± G ® DA = ® B-^Ai.1 ® ^ ± A-^/^GA-^/^ ^ ^ ^-1/2^ 


Using the symmetric eigen/SVD-decomposition, we can write A~^I‘^CA~^l‘^ = EiSiEj and 
5 -i/ 2 £)^-i /2 _ E 2 S 2 EJ, where for i G {1, 2} the Si are diagonal matrices and the Ei are unitary 
matrices. 

This gives 

/ 0 / ± A-^/2^A-i/2 ^ ^-i/2^^-i/2 = J (g, J ± EiSiEJ 0 E 2 S 2 EJ 

= EiEj 0 E2EJ ± E.SiEj 0 E2S2EJ 
= (Gi 0 G2)(/ 0 / ± 0 S2){EJ 0 Ej ) 
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so that 

(J ® J d= ® = (El 0 E2)(/ 0 / ± 0 S2)~^{EJ 0 EJ) 

Note that both J0J and S'i05'2 are diagonal matrices, and thus the middle matrix (J0/±S'i05'2)“^ 
is just the inverse of a diagonal matrix, and so can be computed efficiently. 

Thus we have 

{A®B±C®D)-^ = {A-E^ 0 B-^/^){Ei 0 E2)(/ 0 / ± 0 ^ 2 )"^ (^7 ® E'l){A-^/'^ 0 B-^/^) 

= (El 0 E2)(/ 0 / ± ^1 0 S2)-\KJ 0 Ej) 

where Ei = A~^^‘^Ei and E 2 = B~^^‘^E 2 . 

And so matrix-vector products with (A 0 E ± E 0 can be computed as 

(A 0 E ± E 0 D)-\ = vec (E 2 [(EJV^Ei) 0 (ll^ ± sas^)] KJ) 

where E (Z) E denotes element-wise division of E by E, Si = diag(S'j), and 1 is the vector of 
ones (sized as appropriate). Note that if we wish to compute multiple matrix-vector products with 
(A 0 E ± E 0 D)~^ (as we will in our application), the quantities Ei, E 2 , Si and S 2 only need to 
be computed the first time, thus reducing the cost of any future such matrix-vector products, and 
in particular avoiding any additional SVD computations. 

In the considerably simpler case where A and E are both scalar multiples of the identity, and 
^ is the product of these multiples, we have 

(e/0/±E0E)-i _ (Ei0E2)(e/0/±^i0^2)“^(^7 <»Ej) 

where Ei^iE^ and E2S2EJ are the symmetric eigen/SVD-decompositions of C and D, respec¬ 
tively. And so matrix-vector products with 0 / ± E 0 E)“^ can be computed as 

(^J 0 J ± E 0 D)-^v = vec (E 2 [(EjEEi) 0 ± S2s7)] EJ) 


C Computing Fv and Fv more efficiently 


Note that the Fisher is given by 


F = Eg. [JVhJ] 

where J is the Jacobian of f{x, 9) and Er is the Fisher information matrix of the network’s pre¬ 
dictive distribution E^i^, evaluated dXz = /(x, 9) (where we treat 2 ; as the “parameters”). 


To compute the matrix-vector product En as estimated from a mini-batch we simply compute 
ErJv for each x in the mini-batch, and average the results. This latter operation can be com¬ 


puted in 3 stages (e.g. Martens 20141, which correspond to multiplication of the vector v first by 
J, then by Er, and then by F. 
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Multiplication by J can be performed by a forward pass which is like a linearized version 
of the standard forward pass of Algorithm As Fr is usually diagonal or diagonal plus rank- 
1, matrix-vector multiplications with it are cheap and easy. Finally, multiplication by can 
be performed by a backwards pass which is essentially the same as that of Algorithm See 


Schraudolph (2002); Martens (2014) for further details. 


The naive way of computing Fv is to compute Fv as above, and then compute the in¬ 
ner product of Fn with v. Additionally computing vJFv and Fu would require another such 
matrix-vector product Fu. 


However, if we instead just compute the matrix-vector products Jv (which requires only half 
the work of computing Fv), then computing v'^ Fv as {Jv)~^ Fr{Jv) is essentially free. And with 
Ju computed, we can similarly obtain Fv as {Ju)~^F r{Jv) and uJFu as {Ju)^F r{Ju). 


This trick thus reduces the computational cost associated with computing these various scalars 
by roughly half. 


D Proofs for Section [T0| 


Proof of Theorem^ First we will show that the given network transformation can be viewed as 
reparameterization of the network according to an invertible linear function C^. 

Define 01 = [vec(fF/)'''vec(fFj)'''... vec(lF'/where W} = (so that Wi = 

and let C, be the function which maps 6*^ to 9. Clearly C, is an invertible linear transfor¬ 
mation. 

If the transformed network uses 6*^ in place of 6 then we have 

a\ = fljdj and sj = 


which we can prove by a simple induction. First note that Ug = by definition. Then, assuming 
by induction that = Di_iaj_i, we have 






and therefore also 


a\ = ni0i($i4) = = ^ifiisi) = VtiQi 


And because Qi = I, we have aj = ai, or more simply that aj = a^, and thus both the 
original network and the transformed one have the same output (i.e. f{x, 9) = /^(x, 9^)). From 
this it follows that p{x, 9'^) = f{x, 9) = f{x, C(^^))> ^^d thus the transformed network can be 
viewed as a reparameterization of the original network by 9f Similarly we have that /if (6*^) = 

m = A(c(9*)). 
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The following lemma is adapted from Martens (2014) (see the seetion titled “A eritieal anal¬ 
ysis of parameterization invarianee”). 


Lemma 5. Let ( be some invertible affine function mapping 01 to 9, which reparameterizes the 
objective h{9) as h{({9'')). Suppose that B and are invertible matrices satisfying 


jjBJi^ = B^ 

Then, additively updating 9 by 6 = —aB~^Vh is equivalent to additively updating 0l by 6^ = 
—aB'^~^S7o^h{((9'^)), in the sense that ({9^ + ^1) = 0 + 

Beeause 0,1(01) = 0(0) = 0(C(0l)) we have that V0l = 'Vgth{({9^)). So,_by the above 
lemma, to prove the theorem it suffiees to show that Jj F = Fl and JjF Ji^ = Ff 

Using Wi = ^iW}VLi_i it is straightforward to verify that 

= diag(Uo 0 <f)i, , ^J-i <8 ^e) 


Beeause Sj = ^jsj and the faet that the networks eompute the same outputs (so the loss 
derivatives are identieal), we have by the ehain rule that, gj = Vs\ = ^JVsi = Qi, and 
therefore 



E 




E [4>7a(>I>79.)’"] = <*7 E [9 .s 7] l-i = -iJOiA 


Furthermore, 



E 




E [(Ujai)(%aj)"^] 


UiE [ttioj] Uj = LtiAijOf- 


Using these results we may express the Kroneeker-faetored bloeks of the approximate Fisher 
as eomputed using the transformed network, as follows: 


F}j — ® G\j — 




.<j) . 
,3 ^3 


(a-1 0 4'7)(4-i,i-i ® 0 $,) 

(a_i 0 $7 8) $j) 


Given this identity we thus have 

Fl = diag(Fli,%, 

= diag (^(Uo ® $7)A,i(f^7 ® (^1 ® $ 2 ), • •., ® ^J)Fey{nJ_^ ® <F^)) 

= diag(Uo $7, ® • • •, ^£-1 ® $7) diag Fa, 2 , • ■ •, , F^,^) 

■ diag(U7 (8) $ 1 , U7 ® $ 2 , • • •, ^7-i ® 

= JJFJ^ 
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We now turn our attention to the F (see Seetion 
First note that 


4.3 


for the relevant notation). 


tr+i = = (a-i ® )Fi,i+i(nj ® fi+o ((a ® 0 4i+i)) 

= 0 47 )f.,+,(sj7 0 4.+i)(sj7 0 0 47 +,)-' 

= (a_i 0 0 $7+i)“' 


and so 


y;i = pi — \T/'^ vT/i''^ 

^ i+l,i+l^ i,i+l 

= (a_i 0 0 

- (n,_i 0 $ 7 ) 4 'i,i+i(a 0 $ 7 +i)-i(a 0 $ 7 +i)Fi+i,i+i(fi 7 0 $i+i)(a 0 $ 7 +i)"^ 

= (fii_i 0 $ 7 )Si|,+i(fi 7 _i 0 


Also, sj = fI^ = 0 $7)F^,f(f]7_i 0 <h^) = 0 <h7)S^(f]7_i 0 $£). 

From these faets it follows that 

= diag (si| 2 , S^ 2 | 3 , ..., E{_ 3 |^, eJ) 

= diag ((fio ® d' 7 )Si| 2 (^^o ® *^>7), (^1 ® d>I)S 2 | 3 (f^i 0 $7), ..., 

(fi £_2 0 $7_i)Ef_i|^(fif_2 0 $7_i), 0 $7 )sk^7-i ® 

= diag(f]o ® *^*7) ® • • •) ^ 1-2 ® '^7-i> ® *^7) diag (Ei| 2 , E 213 , • • •, ^e) 

diag(fl7 ® ^^7 ® ^> 2 , ..., fl7-2 ® *^^- 1 ) ^7-1 ® d)^) 


= jjA-^J^ 


Inverting both sides gives 
Next, observe that 

'rAi(ii7-i®'S><)-‘ = (ft0 47„)-T47_,i(ji._i0 47)T(fi7_i0 4.)-‘ 


= (fi7®4i+i)“"'i'7i+i 
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from which it follows that 


■tv- = 


tT 




I 


-^1-1. tTj 


diag((no (g) $i) \ ..., (n|_i 0 <hi!) 


0 $i)-i (^7 (g) $ 2 )“^ 


(g) $ 2 )“^ (^^7 g) $ 3 )“^ 


® i^J-i g ^e)~\ 


(^7 

-(^7 ^$i)-id'72 


(^7 ® *^’ 2 )“^ 

■(f]7 ® $2)"^^73 ® ‘^ 3 )“^ 


{ nj _2 0 ( fi 7 _i 0 $£)-7 


= diag((f]7 ® *^i) ^ (^7 ® *^ 2 ) 7 • • •, (^7-1 ® ^e) 7 


r-i^T 

- 


-^7 


-^gs I 




Combining A and ^ we have 


C “ '"C 


Ft-i = J-^)A{E^^ J-y = (J-^S^)A( 


T 


= yE'^AEjy 


Inverting both sides gives F'^ = Jj F J(^ as required. 


□ 


Proof of Corollary^ First note that a network whieh is transformed so that GI^ = I and A\^ = I 
will satisfy the required properties. To see this, note that Flglgf'] = G\^ = I means that gj is 
whitened with respeet to the model’s distribution by definition (sinee the expeetation is taken with 
respeet to the model’s distribution), and furthermore we have that F[g\] = 0 by default (e.g. using 


Lemma 


4), so g\ is eentered. And sinee E[a|^aJ' ] is the square submatrix of A\^ = I whieh leaves 
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out the last row and column, we also have that E[aja|^] = / and so a\ is whitened. Finally, observe 
that E[aJ] is given by the final column (or row) of Ai^i, excluding the last entry, and is thus equal 
to 0, and so a\ is centered. 

Next, we note that if ^ = / and = I then 

= diag 0 ® ® ..., A\_^^i^_^ ® = diag (/ ® J, / ® /, ..., J ® J) = J 


and so —aF = —a'Vh"' is indeed a standard gradient descent update. 


Finally, we observe that there are choices of fl* and which will make the transformed 

ft - - . - - - - - r-i 


model satisfy G]^ = I and A]^ = F In particular, from the proof of Theorem ^we have 


that 


GF = (Fj Gij^j and AA = , and so taking <!)* = and works. 

The result now follows from Theorem [H 


- 1/2 


- 1/2 


□ 
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